Netdev List
 help / color / mirror / Atom feed
* Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
From: Ira W. Snyder @ 2009-09-21 21:43 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Avi Kivity, Michael S. Tsirkin, netdev, virtualization, kvm,
	linux-kernel, mingo, linux-mm, akpm, hpa, Rusty Russell, s.hetze,
	alacrityvm-devel
In-Reply-To: <4AB1A8FD.2010805@gmail.com>

On Wed, Sep 16, 2009 at 11:11:57PM -0400, Gregory Haskins wrote:
> Avi Kivity wrote:
> > On 09/16/2009 10:22 PM, Gregory Haskins wrote:
> >> Avi Kivity wrote:
> >>   
> >>> On 09/16/2009 05:10 PM, Gregory Haskins wrote:
> >>>     
> >>>>> If kvm can do it, others can.
> >>>>>
> >>>>>          
> >>>> The problem is that you seem to either hand-wave over details like
> >>>> this,
> >>>> or you give details that are pretty much exactly what vbus does
> >>>> already.
> >>>>    My point is that I've already sat down and thought about these
> >>>> issues
> >>>> and solved them in a freely available GPL'ed software package.
> >>>>
> >>>>        
> >>> In the kernel.  IMO that's the wrong place for it.
> >>>      
> >> 3) "in-kernel": You can do something like virtio-net to vhost to
> >> potentially meet some of the requirements, but not all.
> >>
> >> In order to fully meet (3), you would need to do some of that stuff you
> >> mentioned in the last reply with muxing device-nr/reg-nr.  In addition,
> >> we need to have a facility for mapping eventfds and establishing a
> >> signaling mechanism (like PIO+qid), etc. KVM does this with
> >> IRQFD/IOEVENTFD, but we dont have KVM in this case so it needs to be
> >> invented.
> >>    
> > 
> > irqfd/eventfd is the abstraction layer, it doesn't need to be reabstracted.
> 
> Not per se, but it needs to be interfaced.  How do I register that
> eventfd with the fastpath in Ira's rig? How do I signal the eventfd
> (x86->ppc, and ppc->x86)?
> 

Sorry to reply so late to this thread, I've been on vacation for the
past week. If you'd like to continue in another thread, please start it
and CC me.

On the PPC, I've got a hardware "doorbell" register which generates 30
distiguishable interrupts over the PCI bus. I have outbound and inbound
registers, which can be used to signal the "other side".

I assume it isn't too much code to signal an eventfd in an interrupt
handler. I haven't gotten to this point in the code yet.

> To take it to the next level, how do I organize that mechanism so that
> it works for more than one IO-stream (e.g. address the various queues
> within ethernet or a different device like the console)?  KVM has
> IOEVENTFD and IRQFD managed with MSI and PIO.  This new rig does not
> have the luxury of an established IO paradigm.
> 
> Is vbus the only way to implement a solution?  No.  But it is _a_ way,
> and its one that was specifically designed to solve this very problem
> (as well as others).
> 
> (As an aside, note that you generally will want an abstraction on top of
> irqfd/eventfd like shm-signal or virtqueues to do shared-memory based
> event mitigation, but I digress.  That is a separate topic).
> 
> > 
> >> To meet performance, this stuff has to be in kernel and there has to be
> >> a way to manage it.
> > 
> > and management belongs in userspace.
> 
> vbus does not dictate where the management must be.  Its an extensible
> framework, governed by what you plug into it (ala connectors and devices).
> 
> For instance, the vbus-kvm connector in alacrityvm chooses to put DEVADD
> and DEVDROP hotswap events into the interrupt stream, because they are
> simple and we already needed the interrupt stream anyway for fast-path.
> 
> As another example: venet chose to put ->call(MACQUERY) "config-space"
> into its call namespace because its simple, and we already need
> ->calls() for fastpath.  It therefore exports an attribute to sysfs that
> allows the management app to set it.
> 
> I could likewise have designed the connector or device-model differently
> as to keep the mac-address and hotswap-events somewhere else (QEMU/PCI
> userspace) but this seems silly to me when they are so trivial, so I didn't.
> 
> > 
> >> Since vbus was designed to do exactly that, this is
> >> what I would advocate.  You could also reinvent these concepts and put
> >> your own mux and mapping code in place, in addition to all the other
> >> stuff that vbus does.  But I am not clear why anyone would want to.
> >>    
> > 
> > Maybe they like their backward compatibility and Windows support.
> 
> This is really not relevant to this thread, since we are talking about
> Ira's hardware.  But if you must bring this up, then I will reiterate
> that you just design the connector to interface with QEMU+PCI and you
> have that too if that was important to you.
> 
> But on that topic: Since you could consider KVM a "motherboard
> manufacturer" of sorts (it just happens to be virtual hardware), I don't
> know why KVM seems to consider itself the only motherboard manufacturer
> in the world that has to make everything look legacy.  If a company like
> ASUS wants to add some cutting edge IO controller/bus, they simply do
> it.  Pretty much every product release may contain a different array of
> devices, many of which are not backwards compatible with any prior
> silicon.  The guy/gal installing Windows on that system may see a "?" in
> device-manager until they load a driver that supports the new chip, and
> subsequently it works.  It is certainly not a requirement to make said
> chip somehow work with existing drivers/facilities on bare metal, per
> se.  Why should virtual systems be different?
> 
> So, yeah, the current design of the vbus-kvm connector means I have to
> provide a driver.  This is understood, and I have no problem with that.
> 
> The only thing that I would agree has to be backwards compatible is the
> BIOS/boot function.  If you can't support running an image like the
> Windows installer, you are hosed.  If you can't use your ethernet until
> you get a chance to install a driver after the install completes, its
> just like most other systems in existence.  IOW: It's not a big deal.
> 
> For cases where the IO system is needed as part of the boot/install, you
> provide BIOS and/or an install-disk support for it.
> 
> > 
> >> So no, the kernel is not the wrong place for it.  Its the _only_ place
> >> for it.  Otherwise, just use (1) and be done with it.
> >>
> >>    
> > 
> > I'm talking about the config stuff, not the data path.
> 
> As stated above, where config stuff lives is a function of what you
> interface to vbus.  Data-path stuff must be in the kernel for
> performance reasons, and this is what I was referring to.  I think we
> are generally both in agreement, here.
> 
> What I was getting at is that you can't just hand-wave the datapath
> stuff.  We do fast path in KVM with IRQFD/IOEVENTFD+PIO, and we do
> device discovery/addressing with PCI.  Neither of those are available
> here in Ira's case yet the general concepts are needed.  Therefore, we
> have to come up with something else.
> 
> > 
> >>>   Further, if we adopt
> >>> vbus, if drop compatibility with existing guests or have to support both
> >>> vbus and virtio-pci.
> >>>      
> >> We already need to support both (at least to support Ira).  virtio-pci
> >> doesn't work here.  Something else (vbus, or vbus-like) is needed.
> >>    
> > 
> > virtio-ira.
> 
> Sure, virtio-ira and he is on his own to make a bus-model under that, or
> virtio-vbus + vbus-ira-connector to use the vbus framework.  Either
> model can work, I agree.
> 

Yes, I'm having to create my own bus model, a-la lguest, virtio-pci, and
virtio-s390. It isn't especially easy. I can steal lots of code from the
lguest bus model, but sometimes it is good to generalize, especially
after the fourth implemention or so. I think this is what GHaskins tried
to do.


Here is what I've implemented so far:

* a generic virtio-phys-guest layer (my bus model, like lguest)
	- this runs on the crate server (x86) in my system
* a generic virtio-phys-host layer (my /dev/lguest implementation)
	- this runs on the ppc boards in my system
	- this assumes that the kernel will allocate some memory and
	  expose it over PCI in a device-specific way, so the guest can
	  see it as a PCI BAR
* a virtio-phys-mpc83xx driver
	- this runs on the crate server (x86) in my system
	- this interfaces virtio-phys-guest to my mpc83xx board
	- it is a Linux PCI driver, which detects mpc83xx boards, runs
	  ioremap_pci_bar() on the correct PCI BAR, and then gives that
	  to the virtio-phys-guest layer

I think that the idea of device/driver (instead of host/guest) is a good
one. It makes my problem easier to think about.

I've given it some thought, and I think that running vhost-net (or
similar) on the ppc boards, with virtio-net on the x86 crate server will
work. The virtio-ring abstraction is almost good enough to work for this
situation, but I had to re-invent it to work with my boards.

I've exposed a 16K region of memory as PCI BAR1 from my ppc board.
Remember that this is the "host" system. I used each 4K block as a
"device descriptor" which contains:

1) the type of device, config space, etc. for virtio
2) the "desc" table (virtio memory descriptors, see virtio-ring)
3) the "avail" table (available entries in the desc table)

Parts 2 and 3 are repeated three times, to allow for a maximum of three
virtqueues per device. This is good enough for all current drivers.

The guest side (x86 in my system) allocates some device-accessible
memory, and writes the PCI address to the device descriptor. This memory
contains:

1) the "used" table (consumed entries in the desc/avail tables)

This exists three times as well, once for each virtqueue.

The rest is basically a copy of virtio-ring, with a few changes to allow
for cacheing, etc. It may not even be worth doing this from a
performance standpoint, I haven't benchmarked it yet.

For now, I'd be happy with a non-DMA memcpy only solution. I can add DMA
once things are working.

I've got the current code (subject to change at any time) available at
the address listed below. If you think another format would be better
for you, please ask, and I'll provide it.
http://www.mmarray.org/~iws/virtio-phys/

I've gotten plenty of email about this from lots of interested
developers. There are people who would like this kind of system to just
work, while having to write just some glue for their device, just like a
network driver. I hunch most people have created some proprietary mess
that basically works, and left it at that.

So, here is a desperate cry for help. I'd like to make this work, and
I'd really like to see it in mainline. I'm trying to give back to the
community from which I've taken plenty.

Ira

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 3/3] at91_can: add driver for Atmel's CAN controller on AT91SAM9263
From: Andrew Victor @ 2009-09-21 21:44 UTC (permalink / raw)
  To: Marc Kleine-Budde
  Cc: netdev, linux-arm-kernel, Socketcan-core, Andrew Victor, wg
In-Reply-To: <1253180254-11910-4-git-send-email-mkl@pengutronix.de>

hi Marc,

> +static inline u32 at91_read(const struct at91_priv *priv, enum at91_reg reg)
> +{
> +       return readl(priv->reg_base + reg);
> +}
> +
> +static inline void at91_write(const struct at91_priv *priv, enum at91_reg reg,
> +               u32 value)
> +{
> +       writel(value, priv->reg_base + reg);
> +}

Rather use __raw_readl() and __raw_writel().


Regards,
  Andrew Victor

^ permalink raw reply

* Re: fanotify as syscalls
From: Jamie Lokier @ 2009-09-21 22:00 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Eric Paris, Linus Torvalds, Evgeniy Polyakov, David Miller,
	linux-kernel, linux-fsdevel, netdev, viro, alan, hch
In-Reply-To: <200909212327.20978.agruen@suse.de>

Andreas Gruenbacher wrote:
> On Monday, 21 September 2009 22:28:23 Jamie Lokier wrote:
> > It would be logical if fanotify could block and ack those [mount & umount
> > events] in the same way as it can block and ack other accesses (with the
> > usual filtering rules on which inodes trigger events, and which don't or are
> > cached).
> 
> Hmm. To me, fanotify is about file contents first of all: this is what 
> fanotify wants to be able to veto.

Surely you don't assume that what constitutes malicious content is
independent of it's location and/or name?

(See also "echo 'run_virus&' >>.bash_login).

Wait a minute.  You don't assume that, otherwise why the interest in
subtrees? :-)

> Directory events seem reasonable to add for inotify compatibility,

Did you see may point about userspace caches and how directory events
are fundamental to that - there's no way to build a cache without them?

> but I see no need for access decisions on them. 

Please excuse me; I'm a bit confused.  Is fanotify intended just for
use by access decision programs, or is the plan now for it to also be
a replacement for inotify?  I'm getting conflicting signals about
that.

If it's just for access decision programs, and if those aren't going
to care about location, then there's no need to add directory events
to fanotify at all.  But then I'll be demanding subtree support in
inotify, please :-)

> Even less so for mounts and unmounts.

   (as root) mkdir foo; mount dodgy foo -oloop; mount --bind foo/cat /bin/cat

If fanotify doesn't react to that, which is just a fancy way of saying
"zcat virus.gz >/bin/cat" in a way which doesn't cause any writes or
opens, what's the point in it?  Is fanotify only for checking files
written by non-root users?

> (Besides, we can't hold any vfs locks 
> while asking fanotify so those operations wouldn't be atomic, anyway.)

Indeed, good point.

-- Jamie

^ permalink raw reply

* Re: fanotify as syscalls
From: Davide Libenzi @ 2009-09-21 22:18 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Andreas Gruenbacher, Eric Paris, Linus Torvalds, Evgeniy Polyakov,
	David Miller, Linux Kernel Mailing List, linux-fsdevel, netdev,
	viro, alan, hch
In-Reply-To: <20090921202823.GB14700@shareable.org>

On Mon, 21 Sep 2009, Jamie Lokier wrote:

> I think so to, and that'd be a great all round solution.

If this is for anti-malware vendors to intercept userspace accesses 
they're currently doing it by hacking the syscall table, why don't we 
offer a way to monitor syscalls (kernel side) in a non racy way?
Modules can [un]register themselves for syscall intercaption, and receive 
the syscall number and parameters. They'd be able to change paramters, 
return error codes, and so on.
The cost of the check in the syscall path could even be under an 
alternative-like patching, if really neeeded.
The Pros of this would be:

- The kernel code to implement this would be trivially small, with no 
  I-need-this-feature-too growth potential

- There won't be any externally visible API to maintain (and its kernel 
  counter part) and expand

- Any system call can be intercepted, allowing it to be flexible while 
  leaving the burden of the interception handling, and communication with 
  userspace policy enforcers, to the anti-malware (or whoever really) 
  companies modules

The anti-malware are already doing this (intercepting syscall), they 
already have code for it, and they always did (writing kernel 
modules/drivers, that is) for Windows.



- Davide

^ permalink raw reply

* [PATCH 1/2] ibm_newemac: Add Support for MAL Interrupt Coalescing
From: Prodyut Hazarika @ 2009-09-21 22:47 UTC (permalink / raw)
  To: netdev
  Cc: Victor Gallardo, Feng Kan, lada.podivin, Loc Ho, Prodyut Hazarika,
	bhutchings, linuxppc-dev, davem

Support for Hardware Interrupt coalescing in MAL.
Coalescing is supported on the newer revs of 460EX/GT and 405EX.
The MAL driver falls back to EOB IRQ if coalescing not supported

Signed-off-by: Prodyut Hazarika <phazarika@amcc.com>
Acked-by: Victor Gallardo <vgallardo@amcc.com>
Acked-by: Feng Kan <fkan@amcc.com>

---
 drivers/net/ibm_newemac/Kconfig |   27 ++++
 drivers/net/ibm_newemac/mal.c   |  295 
+++++++++++++++++++++++++++++++++++----
 drivers/net/ibm_newemac/mal.h   |   55 +++++++
 3 files changed, 350 insertions(+), 27 deletions(-)

diff --git a/drivers/net/ibm_newemac/Kconfig 
b/drivers/net/ibm_newemac/Kconfig
index 78a1628..8a88173 100644
--- a/drivers/net/ibm_newemac/Kconfig
+++ b/drivers/net/ibm_newemac/Kconfig
@@ -63,6 +63,33 @@ config IBM_NEW_EMAC_EMAC4
 	bool
 	default n

+config IBM_NEW_EMAC_INTR_COALESCE
+	bool "Hardware Interrupt coalescing"
+	depends on IBM_NEW_EMAC && (460EX || 460GT || 405EX)
+	default y
+	help
+	  When selected the Ethernet interrupt coalescing is selected.
+
+config IBM_NEW_EMAC_TX_COAL_COUNT
+	int "TX Coalescence frame count (packets)"
+	depends on IBM_NEW_EMAC_INTR_COALESCE
+	default "16"
+
+config IBM_NEW_EMAC_TX_COAL_TIMER
+	int "TX Coalescence timer (clock ticks)"
+	depends on IBM_NEW_EMAC_INTR_COALESCE
+	default "1000000"
+
+config IBM_NEW_EMAC_RX_COAL_COUNT
+	int "RX Coalescence frame count (packets)"
+	depends on IBM_NEW_EMAC_INTR_COALESCE
+	default "1"
+
+config IBM_NEW_EMAC_RX_COAL_TIMER
+	int "RX Coalescence timer (clock ticks)"
+	depends on IBM_NEW_EMAC_INTR_COALESCE
+	default "1000000"
+
 config IBM_NEW_EMAC_NO_FLOW_CTRL
 	bool
 	default n
diff --git a/drivers/net/ibm_newemac/mal.c b/drivers/net/ibm_newemac/mal.c
index 2a2fc17..7dc06ad 100644
--- a/drivers/net/ibm_newemac/mal.c
+++ b/drivers/net/ibm_newemac/mal.c
@@ -31,6 +31,20 @@
 #include <asm/dcr-regs.h>

 static int mal_count;
+#ifdef CONFIG_IBM_NEW_EMAC_INTR_COALESCE
+static char *tx_coal_irqname[] = {
+	"TX0 COAL",
+	"TX1 COAL",
+	"TX2 COAL",
+	"TX3 COAL",
+};
+static char *rx_coal_irqname[] = {
+	"RX0 COAL",
+	"RX1 COAL",
+	"RX2 COAL",
+	"RX3 COAL",
+};
+#endif

 int __devinit mal_register_commac(struct mal_instance	*mal,
 				  struct mal_commac	*commac)
@@ -217,6 +231,86 @@ static inline void mal_disable_eob_irq(struct 
mal_instance *mal)
 	MAL_DBG2(mal, "disable_irq" NL);
 }

+#ifdef CONFIG_IBM_NEW_EMAC_INTR_COALESCE
+static inline void mal_enable_coal(struct mal_instance *mal)
+{
+	unsigned int val;
+#if defined(CONFIG_405EX)
+	/* Clear the counters */
+	val = SDR0_ICC_FLUSH0 | SDR0_ICC_FLUSH1;
+	mtdcri(SDR0, DCRN_SDR0_ICCRTX, val);
+	mtdcri(SDR0, DCRN_SDR0_ICCRRX, val);
+
+	/* Set Tx/Rx Timer values */
+	mtdcri(SDR0, DCRN_SDR0_ICCTRTX0, CONFIG_IBM_NEW_EMAC_TX_COAL_TIMER);
+	mtdcri(SDR0, DCRN_SDR0_ICCTRTX1, CONFIG_IBM_NEW_EMAC_TX_COAL_TIMER);
+	mtdcri(SDR0, DCRN_SDR0_ICCTRRX0, CONFIG_IBM_NEW_EMAC_RX_COAL_TIMER);
+	mtdcri(SDR0, DCRN_SDR0_ICCTRRX1, CONFIG_IBM_NEW_EMAC_RX_COAL_TIMER);
+
+	/* Enable the Tx/Rx Coalescing interrupt */
+	val = ((CONFIG_IBM_NEW_EMAC_TX_COAL_COUNT & COAL_FRAME_MASK)
+			<< SDR0_ICC_FTHR0_SHIFT) |
+		((CONFIG_IBM_NEW_EMAC_TX_COAL_COUNT & COAL_FRAME_MASK)
+			<< SDR0_ICC_FTHR1_SHIFT);
+	mtdcri(SDR0, DCRN_SDR0_ICCRTX, val);
+
+	val = ((CONFIG_IBM_NEW_EMAC_RX_COAL_COUNT & COAL_FRAME_MASK)
+			<< SDR0_ICC_FTHR0_SHIFT) |
+		((CONFIG_IBM_NEW_EMAC_RX_COAL_COUNT & COAL_FRAME_MASK)
+			<< SDR0_ICC_FTHR1_SHIFT);
+
+	mtdcri(SDR0, DCRN_SDR0_ICCRRX, val);
+#elif defined(CONFIG_460EX) || defined(CONFIG_460GT)
+	/* Clear the counters */
+	val = SDR0_ICC_FLUSH;
+	mtdcri(SDR0, DCRN_SDR0_ICCRTX0, val);
+	mtdcri(SDR0, DCRN_SDR0_ICCRTX1, val);
+	mtdcri(SDR0, DCRN_SDR0_ICCRRX0, val);
+	mtdcri(SDR0, DCRN_SDR0_ICCRRX1, val);
+#if defined(CONFIG_460GT)
+	mtdcri(SDR0, DCRN_SDR0_ICCRTX2, val);
+	mtdcri(SDR0, DCRN_SDR0_ICCRTX3, val);
+	mtdcri(SDR0, DCRN_SDR0_ICCRRX2, val);
+	mtdcri(SDR0, DCRN_SDR0_ICCRRX3, val);
+#endif
+
+	/* Set Tx/Rx Timer values */
+	mtdcri(SDR0, DCRN_SDR0_ICCTRTX0, CONFIG_IBM_NEW_EMAC_TX_COAL_TIMER);
+	mtdcri(SDR0, DCRN_SDR0_ICCTRTX1, CONFIG_IBM_NEW_EMAC_TX_COAL_TIMER);
+	mtdcri(SDR0, DCRN_SDR0_ICCTRRX0, CONFIG_IBM_NEW_EMAC_RX_COAL_TIMER);
+	mtdcri(SDR0, DCRN_SDR0_ICCTRRX1, CONFIG_IBM_NEW_EMAC_RX_COAL_TIMER);
+#if defined(CONFIG_460GT)
+	mtdcri(SDR0, DCRN_SDR0_ICCTRTX2, CONFIG_IBM_NEW_EMAC_TX_COAL_TIMER);
+	mtdcri(SDR0, DCRN_SDR0_ICCTRTX3, CONFIG_IBM_NEW_EMAC_TX_COAL_TIMER);
+	mtdcri(SDR0, DCRN_SDR0_ICCTRRX2, CONFIG_IBM_NEW_EMAC_RX_COAL_TIMER);
+	mtdcri(SDR0, DCRN_SDR0_ICCTRRX3, CONFIG_IBM_NEW_EMAC_RX_COAL_TIMER);
+#endif
+
+	/* Enable the Tx/Rx Coalescing interrupt */
+	val = (CONFIG_IBM_NEW_EMAC_TX_COAL_COUNT & COAL_FRAME_MASK)
+			<< SDR0_ICC_FTHR_SHIFT;
+	mtdcri(SDR0, DCRN_SDR0_ICCRTX0, val);
+	mtdcri(SDR0, DCRN_SDR0_ICCRTX1, val);
+#if defined(CONFIG_460GT)
+	mtdcri(SDR0, DCRN_SDR0_ICCRTX2, val);
+	mtdcri(SDR0, DCRN_SDR0_ICCRTX3, val);
+#endif
+
+	val = (CONFIG_IBM_NEW_EMAC_RX_COAL_COUNT & COAL_FRAME_MASK)
+			<< SDR0_ICC_FTHR_SHIFT;
+	mtdcri(SDR0, DCRN_SDR0_ICCRRX0, val);
+	mtdcri(SDR0, DCRN_SDR0_ICCRRX1, val);
+#if defined(CONFIG_460GT)
+	mtdcri(SDR0, DCRN_SDR0_ICCRRX2, val);
+	mtdcri(SDR0, DCRN_SDR0_ICCRRX3, val);
+#endif
+#endif
+	printk(KERN_INFO "MAL: Enabled Intr Coal TxCnt: %d RxCnt: %d\n",
+		CONFIG_IBM_NEW_EMAC_TX_COAL_COUNT,
+		CONFIG_IBM_NEW_EMAC_RX_COAL_COUNT);
+}
+#endif
+
 static irqreturn_t mal_serr(int irq, void *dev_instance)
 {
 	struct mal_instance *mal = dev_instance;
@@ -309,6 +403,15 @@ static irqreturn_t mal_rxeob(int irq, void 
*dev_instance)
 	return IRQ_HANDLED;
 }

+#ifdef CONFIG_IBM_NEW_EMAC_INTR_COALESCE
+static irqreturn_t mal_coal(int irq, void *dev_instance)
+{
+	struct mal_instance *mal = dev_instance;
+	mal_schedule_poll(mal);
+	return IRQ_HANDLED;
+}
+#endif
+
 static irqreturn_t mal_txde(int irq, void *dev_instance)
 {
 	struct mal_instance *mal = dev_instance;
@@ -527,6 +630,10 @@ static int __devinit mal_probe(struct of_device 
*ofdev,
 	u32 cfg;
 	unsigned long irqflags;
 	irq_handler_t hdlr_serr, hdlr_txde, hdlr_rxde;
+#ifdef CONFIG_IBM_NEW_EMAC_INTR_COALESCE
+	int num_phys_chans;
+	int coal_intr_index;
+#endif

 	mal = kzalloc(sizeof(struct mal_instance), GFP_KERNEL);
 	if (!mal) {
@@ -609,6 +716,50 @@ static int __devinit mal_probe(struct of_device 
*ofdev,
 		goto fail_unmap;
 	}

+#ifdef CONFIG_IBM_NEW_EMAC_INTR_COALESCE
+	/* Number of Tx channels is equal to Physical channels */
+	/* Rx channels include Virtual channels so use Tx channels */
+	BUG_ON(mal->num_tx_chans > MAL_MAX_PHYS_CHANNELS);
+	num_phys_chans = mal->num_tx_chans;
+	/* Older revs in 460EX and 460GT have coalesce bug in h/w */
+#if defined(CONFIG_460EX) || defined(CONFIG_460GT)
+	{
+		unsigned int pvr;
+		unsigned short min;
+		pvr = mfspr(SPRN_PVR);
+		min = PVR_MIN(pvr);
+		if (min < 4) {
+			printk(KERN_INFO "PVR %x Intr Coal disabled: H/W bug\n",
+					pvr);
+			mal->coalesce_disabled = 1;
+		}
+	}
+#else
+	mal->coalesce_disabled = 0;
+#endif
+	coal_intr_index = 5;
+
+	/* If device tree doesn't Interrupt coal IRQ, fall back to EOB IRQ */
+	for (i = 0; (i < num_phys_chans) && !mal->coalesce_disabled; i++) {
+		mal->txcoal_irq[i] =
+			irq_of_parse_and_map(ofdev->node, coal_intr_index++);
+		if (mal->txcoal_irq[i] == NO_IRQ) {
+			printk(KERN_INFO "MAL: No device tree IRQ "
+				"for TxCoal%d  - disabling coalescing\n", i);
+			mal->coalesce_disabled = 1;
+		}
+	}
+	for (i = 0; (i < num_phys_chans) && !mal->coalesce_disabled ; i++) {
+		mal->rxcoal_irq[i] =
+			irq_of_parse_and_map(ofdev->node, coal_intr_index++);
+		if (mal->rxcoal_irq[i] == NO_IRQ) {
+			printk(KERN_INFO "MAL: No device tree IRQ "
+				"for RxCoal%d  - disabling coalescing\n", i);
+			mal->coalesce_disabled = 1;
+		}
+	}
+#endif
+
 	INIT_LIST_HEAD(&mal->poll_list);
 	INIT_LIST_HEAD(&mal->list);
 	spin_lock_init(&mal->lock);
@@ -674,20 +825,69 @@ static int __devinit mal_probe(struct of_device 
*ofdev,
 	}

 	err = request_irq(mal->serr_irq, hdlr_serr, irqflags, "MAL SERR", mal);
-	if (err)
-		goto fail2;
+	if (err) {
+		mal->serr_irq = NO_IRQ;
+		goto failirq;
+	}
 	err = request_irq(mal->txde_irq, hdlr_txde, irqflags, "MAL TX DE", mal);
-	if (err)
-		goto fail3;
-	err = request_irq(mal->txeob_irq, mal_txeob, 0, "MAL TX EOB", mal);
-	if (err)
-		goto fail4;
+	if (err) {
+		mal->txde_irq = NO_IRQ;
+		goto failirq;
+	}
 	err = request_irq(mal->rxde_irq, hdlr_rxde, irqflags, "MAL RX DE", mal);
-	if (err)
-		goto fail5;
-	err = request_irq(mal->rxeob_irq, mal_rxeob, 0, "MAL RX EOB", mal);
-	if (err)
-		goto fail6;
+	if (err) {
+		mal->rxde_irq = NO_IRQ;
+		goto failirq;
+	}
+#ifdef CONFIG_IBM_NEW_EMAC_INTR_COALESCE
+	for (i = 0; (i < num_phys_chans) && !mal->coalesce_disabled; i++) {
+		err = request_irq(mal->txcoal_irq[i],
+					mal_coal, 0, tx_coal_irqname[i], mal);
+		if (err) {
+			printk(KERN_INFO "MAL: TxCoal%d ReqIRQ failed "
+					" - disabling coalescing\n", i);
+			mal->txcoal_irq[i] = NO_IRQ;
+			mal->coalesce_disabled = 1;
+			break;
+		}
+	}
+	for (i = 0; (i < num_phys_chans) && !mal->coalesce_disabled; i++) {
+		err = request_irq(mal->rxcoal_irq[i],
+					mal_coal, 0, rx_coal_irqname[i], mal);
+		if (err) {
+			printk(KERN_INFO "MAL: RxCoal%d ReqIRQ failed - "
+					"disabling coalescing\n", i);
+			mal->rxcoal_irq[i] = NO_IRQ;
+			mal->coalesce_disabled = 1;
+			break;
+		}
+	}
+
+	/* Fall back to EOB IRQ if coalesce not supported */
+	if (mal->coalesce_disabled) {
+		/* Clean up any IRQs allocated for Coalescing */
+		for (i = 0; i < num_phys_chans; i++) {
+			if (mal->txcoal_irq[i] != NO_IRQ)
+				free_irq(mal->txcoal_irq[i], mal);
+			if (mal->rxcoal_irq[i] != NO_IRQ)
+				free_irq(mal->rxcoal_irq[i], mal);
+		}
+#endif
+		err = request_irq(mal->txeob_irq, mal_txeob, 0,
+					"MAL TX EOB", mal);
+		if (err) {
+			mal->txeob_irq = NO_IRQ;
+			goto failirq;
+		}
+		err = request_irq(mal->rxeob_irq, mal_rxeob, 0,
+					"MAL RX EOB", mal);
+		if (err) {
+			mal->rxeob_irq = NO_IRQ;
+			goto failirq;
+		}
+#ifdef CONFIG_IBM_NEW_EMAC_INTR_COALESCE
+	}
+#endif

 	/* Enable all MAL SERR interrupt sources */
 	if (mal->version == 2)
@@ -695,6 +895,10 @@ static int __devinit mal_probe(struct of_device 
*ofdev,
 	else
 		set_mal_dcrn(mal, MAL_IER, MAL1_IER_EVENTS);

+#ifdef CONFIG_IBM_NEW_EMAC_INTR_COALESCE
+	if (mal->coalesce_disabled == 0)
+		mal_enable_coal(mal);
+#endif
 	/* Enable EOB interrupt */
 	mal_enable_eob_irq(mal);

@@ -711,15 +915,30 @@ static int __devinit mal_probe(struct of_device 
*ofdev,

 	return 0;

- fail6:
-	free_irq(mal->rxde_irq, mal);
- fail5:
-	free_irq(mal->txeob_irq, mal);
- fail4:
-	free_irq(mal->txde_irq, mal);
- fail3:
-	free_irq(mal->serr_irq, mal);
- fail2:
+ failirq:
+	if (mal->serr_irq != NO_IRQ)
+		free_irq(mal->serr_irq, mal);
+	if (mal->txde_irq != NO_IRQ)
+		free_irq(mal->txde_irq, mal);
+	if (mal->rxde_irq != NO_IRQ)
+		free_irq(mal->rxde_irq, mal);
+#ifdef CONFIG_IBM_NEW_EMAC_INTR_COALESCE
+	if (mal->coalesce_disabled == 0) {
+		for (i = 0; i < num_phys_chans; i++) {
+			if (mal->txcoal_irq[i] != NO_IRQ)
+				free_irq(mal->txcoal_irq[i], mal);
+			if (mal->rxcoal_irq[i] != NO_IRQ)
+				free_irq(mal->rxcoal_irq[i], mal);
+		}
+	} else {
+#endif
+		if (mal->txeob_irq != NO_IRQ)
+			free_irq(mal->txeob_irq, mal);
+		if (mal->rxeob_irq != NO_IRQ)
+			free_irq(mal->rxeob_irq, mal);
+#ifdef CONFIG_IBM_NEW_EMAC_INTR_COALESCE
+	}
+#endif
 	dma_free_coherent(&ofdev->dev, bd_size, mal->bd_virt, mal->bd_dma);
  fail_unmap:
 	dcr_unmap(mal->dcr_host, 0x100);
@@ -732,6 +951,10 @@ static int __devinit mal_probe(struct of_device 
*ofdev,
 static int __devexit mal_remove(struct of_device *ofdev)
 {
 	struct mal_instance *mal = dev_get_drvdata(&ofdev->dev);
+#ifdef CONFIG_IBM_NEW_EMAC_INTR_COALESCE
+	int	i;
+	int	num_phys_chans;
+#endif

 	MAL_DBG(mal, "remove" NL);

@@ -748,12 +971,30 @@ static int __devexit mal_remove(struct of_device 
*ofdev)

 	dev_set_drvdata(&ofdev->dev, NULL);

-	free_irq(mal->serr_irq, mal);
-	free_irq(mal->txde_irq, mal);
-	free_irq(mal->txeob_irq, mal);
-	free_irq(mal->rxde_irq, mal);
-	free_irq(mal->rxeob_irq, mal);
-
+	if (mal->serr_irq != NO_IRQ)
+		free_irq(mal->serr_irq, mal);
+	if (mal->txde_irq != NO_IRQ)
+		free_irq(mal->txde_irq, mal);
+	if (mal->rxde_irq != NO_IRQ)
+		free_irq(mal->rxde_irq, mal);
+#ifdef CONFIG_IBM_NEW_EMAC_INTR_COALESCE
+	num_phys_chans = mal->num_tx_chans;
+	if (mal->coalesce_disabled == 0) {
+		for (i = 0; i < num_phys_chans; i++) {
+			if (mal->txcoal_irq[i] != NO_IRQ)
+				free_irq(mal->txcoal_irq[i], mal);
+			if (mal->rxcoal_irq[i] != NO_IRQ)
+				free_irq(mal->rxcoal_irq[i], mal);
+		}
+	} else {
+#endif
+		if (mal->txeob_irq != NO_IRQ)
+			free_irq(mal->txeob_irq, mal);
+		if (mal->rxeob_irq != NO_IRQ)
+			free_irq(mal->rxeob_irq, mal);
+#ifdef CONFIG_IBM_NEW_EMAC_INTR_COALESCE
+	}
+#endif
 	mal_reset(mal);

 	mal_dbg_unregister(mal);
diff --git a/drivers/net/ibm_newemac/mal.h b/drivers/net/ibm_newemac/mal.h
index 9ededfb..a93c352 100644
--- a/drivers/net/ibm_newemac/mal.h
+++ b/drivers/net/ibm_newemac/mal.h
@@ -169,6 +169,56 @@ struct mal_descriptor {
 #define MAL_TX_CTRL_LAST	0x1000
 #define MAL_TX_CTRL_INTR	0x0400

+#if defined(CONFIG_405EX)
+#define DCRN_SDR0_ICCRTX	0x430B /* Int coal Tx control register */
+#define DCRN_SDR0_ICCRRX	0x430C /* Int coal Rx control register */
+#define SDR0_ICC_FTHR0_SHIFT	23
+#define SDR0_ICC_FLUSH0		22
+#define SDR0_ICC_FLUWI0		21
+#define SDR0_ICC_FTHR1_SHIFT	12
+#define SDR0_ICC_FLUSH1		11
+#define SDR0_ICC_FLUWI1		10
+#define DCRN_SDR0_ICCTRTX0	0x430D /* Int coal Tx0 count threshold */
+#define DCRN_SDR0_ICCTRTX1	0x430E /* Int coal Tx1 count threshold */
+#define DCRN_SDR0_ICCTRRX0	0x430F /* Int coal Rx0 count threshold */
+#define DCRN_SDR0_ICCTRRX1	0x4310 /* Int coal Rx1 count threshold */
+#define DCRN_SDR0_ICTSRTX0	0x4307 /* Int coal Tx0 timer status*/
+#define DCRN_SDR0_ICTSRTX1	0x4308 /* Int coal Tx1 timer status*/
+#define DCRN_SDR0_ICTSRRX0	0x4309 /* Int coal Rx0 timer status*/
+#define DCRN_SDR0_ICTSRRX1	0x430A /* Int coal Rx1 timer status*/
+#elif defined(CONFIG_460EX) || defined(CONFIG_460GT)
+#define DCRN_SDR0_ICCRTX0	0x4410 /* Int coal Tx0 control register */
+#define DCRN_SDR0_ICCRTX1	0x4411 /* Int coal Tx1 control register */
+#define DCRN_SDR0_ICCRTX2	0x4412 /* Int coal Tx2 control register */
+#define DCRN_SDR0_ICCRTX3	0x4413 /* Int coal Tx3 control register */
+#define DCRN_SDR0_ICCRRX0	0x4414 /* Int coal Rx0 control register */
+#define DCRN_SDR0_ICCRRX1	0x4415 /* Int coal Rx1 control register */
+#define DCRN_SDR0_ICCRRX2	0x4416 /* Int coal Rx2 control register */
+#define DCRN_SDR0_ICCRRX3	0x4417 /* Int coal Rx3 control register */
+#define SDR0_ICC_FTHR_SHIFT	23
+#define SDR0_ICC_FLUSH		22
+#define SDR0_ICC_FLUWI		21
+#define DCRN_SDR0_ICCTRTX0	0x4418 /* Int coal Tx0 count threshold */
+#define DCRN_SDR0_ICCTRTX1	0x4419 /* Int coal Tx1 count threshold */
+#define DCRN_SDR0_ICCTRTX2	0x441A /* Int coal Tx2 count threshold */
+#define DCRN_SDR0_ICCTRTX3	0x441B /* Int coal Tx3 count threshold */
+#define DCRN_SDR0_ICCTRRX0	0x441C /* Int coal Rx0 count threshold */
+#define DCRN_SDR0_ICCTRRX1	0x441D /* Int coal Rx1 count threshold */
+#define DCRN_SDR0_ICCTRRX2	0x441E /* Int coal Rx2 count threshold */
+#define DCRN_SDR0_ICCTRRX3	0x441F /* Int coal Rx3 count threshold */
+#define DCRN_SDR0_ICTSRTX0	0x4420 /* Int coal Tx0 timer status*/
+#define DCRN_SDR0_ICTSRTX1	0x4421 /* Int coal Tx1 timer status*/
+#define DCRN_SDR0_ICTSRTX2	0x4422 /* Int coal Tx2 timer status*/
+#define DCRN_SDR0_ICTSRTX3	0x4423 /* Int coal Tx3 timer status*/
+#define DCRN_SDR0_ICTSRRX0	0x4424 /* Int coal Rx0 timer status*/
+#define DCRN_SDR0_ICTSRRX1	0x4425 /* Int coal Rx1 timer status*/
+#define DCRN_SDR0_ICTSRRX2	0x4426 /* Int coal Rx2 timer status*/
+#define DCRN_SDR0_ICTSRRX3	0x4427 /* Int coal Rx3 timer status*/
+#endif
+
+#define COAL_FRAME_MASK		0x1FF
+#define MAL_MAX_PHYS_CHANNELS	4
+
 struct mal_commac_ops {
 	void	(*poll_tx) (void *dev);
 	int	(*poll_rx) (void *dev, int budget);
@@ -217,6 +267,11 @@ struct mal_instance {
 	struct net_device	dummy_dev;

 	unsigned int features;
+#ifdef CONFIG_IBM_NEW_EMAC_INTR_COALESCE
+	int			txcoal_irq[MAL_MAX_PHYS_CHANNELS];
+	int			rxcoal_irq[MAL_MAX_PHYS_CHANNELS];
+	int			coalesce_disabled;
+#endif
 };

 static inline u32 get_mal_dcrn(struct mal_instance *mal, int reg)
-- 
1.5.6
--------------------------------------------------------

CONFIDENTIALITY NOTICE: This e-mail message, including any attachments, is 
for the sole use of the intended recipient(s) and contains information 
that is confidential and proprietary to AppliedMicro Corporation or its 
subsidiaries. It is to be used solely for the purpose of furthering the 
parties' business relationship. All unauthorized review, use, disclosure 
or distribution is prohibited. If you are not the intended recipient, 
please contact the sender by reply e-mail and destroy all copies of the 
original message.

^ permalink raw reply related

* [PATCH 2/2] ibm_newemac: MAL Coalescing in Canyonlands/Kilauea/Glacier dts
From: Prodyut Hazarika @ 2009-09-21 22:47 UTC (permalink / raw)
  To: netdev
  Cc: Victor Gallardo, Feng Kan, lada.podivin, Loc Ho, Prodyut Hazarika,
	bhutchings, linuxppc-dev, davem

Support for MAL interrupt coalescing in Canyonlands, Kilauea & Glacier 
dts.
MAL driver falls back to EOB IRQ if Coalescing IRQ mapping missing in dts

Signed-off-by: Prodyut Hazarika <phazarika@amcc.com>
Acked-by: Victor Gallardo <vgallardo@amcc.com>
Acked-by: Feng Kan <fkan@amcc.com>

---
 arch/powerpc/boot/dts/canyonlands.dts |    6 +++++-
 arch/powerpc/boot/dts/glacier.dts     |   10 +++++++++-
 arch/powerpc/boot/dts/kilauea.dts     |    8 ++++++--
 3 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/boot/dts/canyonlands.dts 
b/arch/powerpc/boot/dts/canyonlands.dts
index c920170..5803a5b 100644
--- a/arch/powerpc/boot/dts/canyonlands.dts
+++ b/arch/powerpc/boot/dts/canyonlands.dts
@@ -146,7 +146,11 @@
 					/*RXEOB*/ 0x7 0x4
 					/*SERR*/  0x3 0x4
 					/*TXDE*/  0x4 0x4
-					/*RXDE*/  0x5 0x4>;
+					/*RXDE*/  0x5 0x4
+					/*TX0 COAL*/  0x8 0x2
+					/*TX1 COAL*/  0x9 0x2
+					/*RX0 COAL*/  0xc 0x2
+					/*RX1 COAL*/  0xd 0x2 >;
 		};

 		USB0: ehci@bffd0400 {
diff --git a/arch/powerpc/boot/dts/glacier.dts 
b/arch/powerpc/boot/dts/glacier.dts
index f3787a2..9af473f 100644
--- a/arch/powerpc/boot/dts/glacier.dts
+++ b/arch/powerpc/boot/dts/glacier.dts
@@ -130,7 +130,15 @@
 					/*RXEOB*/ 0x7 0x4
 					/*SERR*/  0x3 0x4
 					/*TXDE*/  0x4 0x4
-					/*RXDE*/  0x5 0x4>;
+					/*RXDE*/  0x5 0x4
+					/*TX0 COAL*/  0x8 0x2
+					/*TX1 COAL*/  0x9 0x2
+					/*TX2 COAL*/  0xa 0x2
+					/*TX3 COAL*/  0xb 0x2
+					/*RX0 COAL*/  0xc 0x2
+					/*RX1 COAL*/  0xd 0x2
+					/*RX2 COAL*/  0xe 0x2
+					/*RX3 COAL*/  0xf 0x2 >;
 			desc-base-addr-high = <0x8>;
 		};

diff --git a/arch/powerpc/boot/dts/kilauea.dts 
b/arch/powerpc/boot/dts/kilauea.dts
index c465614..14057a2 100644
--- a/arch/powerpc/boot/dts/kilauea.dts
+++ b/arch/powerpc/boot/dts/kilauea.dts
@@ -110,7 +110,7 @@
 			num-tx-chans = <2>;
 			num-rx-chans = <2>;
 			interrupt-parent = <&MAL0>;
-			interrupts = <0x0 0x1 0x2 0x3 0x4>;
+			interrupts = <0x0 0x1 0x2 0x3 0x4 0x5 0x6 0x7 0x8>;
 			#interrupt-cells = <1>;
 			#address-cells = <0>;
 			#size-cells = <0>;
@@ -118,7 +118,11 @@
 					/*RXEOB*/ 0x1 &UIC0 0xb 0x4
 					/*SERR*/  0x2 &UIC1 0x0 0x4
 					/*TXDE*/  0x3 &UIC1 0x1 0x4
-					/*RXDE*/  0x4 &UIC1 0x2 0x4>;
+					/*RXDE*/  0x4 &UIC1 0x2 0x4
+					/*TX0 COAL*/  0x5 &UIC2 0x7 0x2
+					/*TX1 COAL*/  0x6 &UIC2 0x8 0x2
+					/*RX0 COAL*/  0x7 &UIC2 0x9 0x2
+					/*RX1 COAL*/  0x8 &UIC2 0xa 0x2 >;
 			interrupt-map-mask = <0xffffffff>;
 		};

-- 
1.5.6
--------------------------------------------------------

CONFIDENTIALITY NOTICE: This e-mail message, including any attachments, is 
for the sole use of the intended recipient(s) and contains information 
that is confidential and proprietary to AppliedMicro Corporation or its 
subsidiaries. It is to be used solely for the purpose of furthering the 
parties' business relationship. All unauthorized review, use, disclosure 
or distribution is prohibited. If you are not the intended recipient, 
please contact the sender by reply e-mail and destroy all copies of the 
original message.

^ permalink raw reply related

* Re: fanotify as syscalls
From: Andreas Gruenbacher @ 2009-09-21 23:09 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Eric Paris, Linus Torvalds, Evgeniy Polyakov, David Miller,
	linux-kernel, linux-fsdevel, netdev, viro, alan, hch
In-Reply-To: <20090921220002.GE14700@shareable.org>

On Tuesday, 22 September 2009 0:00:02 Jamie Lokier wrote:
> Andreas Gruenbacher wrote:
> > On Monday, 21 September 2009 22:28:23 Jamie Lokier wrote:
> > > It would be logical if fanotify could block and ack those [mount &
> > > umount events] in the same way as it can block and ack other accesses
> > > (with the usual filtering rules on which inodes trigger events, and
> > > which don't or are cached).
> >
> > Hmm. To me, fanotify is about file contents first of all: this is what
> > fanotify wants to be able to veto.
>
> Surely you don't assume that what constitutes malicious content is
> independent of it's location and/or name?

If the antimalware vendors want to base their decisions on pathnames then 
that's their decision, and they can check /proc/self/fd/N. We should be able 
to treat directory events the same.

> (See also "echo 'run_virus&' >>.bash_login).
>
> Wait a minute.  You don't assume that, otherwise why the interest in
> subtrees? :-)
>
> > Directory events seem reasonable to add for inotify compatibility,
>
> Did you see may point about userspace caches and how directory events
> are fundamental to that - there's no way to build a cache without them?

Yes, there were some doubts about this appoach. Waiting for your code to 
demonstrate; an object based cache (e.g., st_dev + st_ino) rather than a 
pathname based cache would seem more reasonable.

> > but I see no need for access decisions on them.
>
> Please excuse me; I'm a bit confused.  Is fanotify intended just for
> use by access decision programs, or is the plan now for it to also be
> a replacement for inotify?  I'm getting conflicting signals about
> that.

Inotify doesn't support access decisions. So where's the problem with 
having "notify only" events for directory / mount / unmount events?

> If it's just for access decision programs, and if those aren't going
> to care about location, then there's no need to add directory events
> to fanotify at all.  But then I'll be demanding subtree support in
> inotify, please :-)
>
> > Even less so for mounts and unmounts.
>
>    (as root) mkdir foo; mount dodgy foo -oloop; mount --bind foo/cat
> /bin/cat

... and then someone accesses /bin/cat, which triggers a fanotify access 
decision.

Thanks,
Andreas

^ permalink raw reply

* Re: fanotify as syscalls
From: Jamie Lokier @ 2009-09-21 23:12 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Andreas Gruenbacher, Eric Paris, Linus Torvalds, Evgeniy Polyakov,
	David Miller, Linux Kernel Mailing List, linux-fsdevel, netdev,
	viro, alan, hch
In-Reply-To: <alpine.DEB.2.00.0909211456180.1116@makko.or.mcafeemobile.com>

Davide Libenzi wrote:
> On Mon, 21 Sep 2009, Jamie Lokier wrote:
> 
> > I think so to, and that'd be a great all round solution.
> 
> If this is for anti-malware vendors

Personally I'm not interested in anti-malware, and am simply
interested in leveraging fsnotify improvements to accelerate userspace
caches of information which depends on files (indexes, templates,
compiler caches, stat caches etc.).  Basically make inotify better,
and sufficiently correct for that purpose.

My sticking my oar in lately is to ensure the fsnotify improvements
are going in the (imho) right direction.  There's a lot of interesting
apps waiting in the wings on this.  It doesn't have to be complicated,
just... sensible.

> to intercept userspace accesses 
> they're currently doing it by hacking the syscall table, why don't we 
> offer a way to monitor syscalls (kernel side) in a non racy way?
> Modules can [un]register themselves for syscall intercaption, and receive 
> the syscall number and parameters. They'd be able to change paramters, 
> return error codes, and so on.
> The cost of the check in the syscall path could even be under an 
> alternative-like patching, if really neeeded.
> The Pros of this would be:
> 
> - The kernel code to implement this would be trivially small, with no 
>   I-need-this-feature-too growth potential

(Fwiw, the {fa,fs,i}notify thing looks to me like it's getting simpler
as we go.  Good design = decrease complexity + increase versatility.
E.g. see epoll.)

> - There won't be any externally visible API to maintain (and its kernel 
>   counter part) and expand
> 
> - Any system call can be intercepted, allowing it to be flexible while 
>   leaving the burden of the interception handling, and communication with 
>   userspace policy enforcers, to the anti-malware (or whoever really) 
>   companies modules
> 
> The anti-malware are already doing this (intercepting syscall), they 
> already have code for it, and they always did (writing kernel 
> modules/drivers, that is) for Windows.

I don't mind at all if fanotify is replaced by a general purpose "take
over the system call table" solution for anti-malware, and I still get
to keep the fsnotify improvements :-)

But I can't help noticing that we _already_ have quite well placed
hooks for intercepting system calls, called security_this and
security_that (SELinux etc), albeit they can't redirect things so much.

However, being a little kinder, I suspect even the anti-malware
vendors would rather not slow down everything with race-prone
complicated tracking of everything every process does...  which is why
fanotify allows it's "interest set" to be reduced from everything to a
subset of files, and it's results to be cached, and let the races be
handled in the normal way by VFS.

Once you have an "interest set" and focus on files, it looks somewhat
reasonable to use the fsnotify hooks.

...That is, if you believe monitoring files is the best approach to
anti-malware.  I can't help noticing that on (ahem) Windows, running
just a "virus checker" which generically scans every file independent
of it's location looking for signatures and keeping up with patches is
no longer considered good enough.

-- Jamie


^ permalink raw reply

* Re: [PATCH 1/2] ibm_newemac: Add Support for MAL Interrupt Coalescing
From: Benjamin Herrenschmidt @ 2009-09-21 23:41 UTC (permalink / raw)
  To: Prodyut Hazarika
  Cc: netdev, Feng Kan, Loc Ho, Victor Gallardo, bhutchings,
	linuxppc-dev, davem, jwboyer, lada.podivin
In-Reply-To: <1253573245-1867-1-git-send-email-phazarika@amcc.com>

On Mon, 2009-09-21 at 15:47 -0700, Prodyut Hazarika wrote:
> Support for Hardware Interrupt coalescing in MAL.
> Coalescing is supported on the newer revs of 460EX/GT and 405EX.
> The MAL driver falls back to EOB IRQ if coalescing not supported
> 
> Signed-off-by: Prodyut Hazarika <phazarika@amcc.com>
> Acked-by: Victor Gallardo <vgallardo@amcc.com>
> Acked-by: Feng Kan <fkan@amcc.com>

There's an awful lot of ifdef based on the CPU type in there. This is
not right.

What happens if we build a kernel that is supposed to boot with two
different variants of 405 or 440 ?

All of this should be runtime features.

ie:

> #ifdef CONFIG_IBM_NEW_EMAC_INTR_COALESCE
> +static inline void mal_enable_coal(struct mal_instance *mal)
> +{
> +	unsigned int val;
> +#if defined(CONFIG_405EX)
> +	/* Clear the counters */
> +	val = SDR0_ICC_FLUSH0 | SDR0_ICC_FLUSH1;
> +	mtdcri(SDR0, DCRN_SDR0_ICCRTX, val);
> +	mtdcri(SDR0, DCRN_SDR0_ICCRRX, val);
> +
> +	/* Set Tx/Rx Timer values */
> +	mtdcri(SDR0, DCRN_SDR0_ICCTRTX0, CONFIG_IBM_NEW_EMAC_TX_COAL_TIMER);
> +	mtdcri(SDR0, DCRN_SDR0_ICCTRTX1, CONFIG_IBM_NEW_EMAC_TX_COAL_TIMER);
> +	mtdcri(SDR0, DCRN_SDR0_ICCTRRX0, CONFIG_IBM_NEW_EMAC_RX_COAL_TIMER);
> +	mtdcri(SDR0, DCRN_SDR0_ICCTRRX1, CONFIG_IBM_NEW_EMAC_RX_COAL_TIMER);
> +
> +	/* Enable the Tx/Rx Coalescing interrupt */
> +	val = ((CONFIG_IBM_NEW_EMAC_TX_COAL_COUNT & COAL_FRAME_MASK)
> +			<< SDR0_ICC_FTHR0_SHIFT) |
> +		((CONFIG_IBM_NEW_EMAC_TX_COAL_COUNT & COAL_FRAME_MASK)
> +			<< SDR0_ICC_FTHR1_SHIFT);
> +	mtdcri(SDR0, DCRN_SDR0_ICCRTX, val);
> +
> +	val = ((CONFIG_IBM_NEW_EMAC_RX_COAL_COUNT & COAL_FRAME_MASK)
> +			<< SDR0_ICC_FTHR0_SHIFT) |
> +		((CONFIG_IBM_NEW_EMAC_RX_COAL_COUNT & COAL_FRAME_MASK)
> +			<< SDR0_ICC_FTHR1_SHIFT);
> +
> +	mtdcri(SDR0, DCRN_SDR0_ICCRRX, val);
> +#elif defined(CONFIG_460EX) || defined(CONFIG_460GT)
> +	/* Clear the counters */
> +	val = SDR0_ICC_FLUSH;
> +	mtdcri(SDR0, DCRN_SDR0_ICCRTX0, val);
> +	mtdcri(SDR0, DCRN_SDR0_ICCRTX1, val);
> +	mtdcri(SDR0, DCRN_SDR0_ICCRRX0, val);
> +	mtdcri(SDR0, DCRN_SDR0_ICCRRX1, val);
> +#if defined(CONFIG_460GT)
> +	mtdcri(SDR0, DCRN_SDR0_ICCRTX2, val);
> +	mtdcri(SDR0, DCRN_SDR0_ICCRTX3, val);
> +	mtdcri(SDR0, DCRN_SDR0_ICCRRX2, val);
> +	mtdcri(SDR0, DCRN_SDR0_ICCRRX3, val);
> +#endif
> +
> +	/* Set Tx/Rx Timer values */
> +	mtdcri(SDR0, DCRN_SDR0_ICCTRTX0, CONFIG_IBM_NEW_EMAC_TX_COAL_TIMER);
> +	mtdcri(SDR0, DCRN_SDR0_ICCTRTX1, CONFIG_IBM_NEW_EMAC_TX_COAL_TIMER);
> +	mtdcri(SDR0, DCRN_SDR0_ICCTRRX0, CONFIG_IBM_NEW_EMAC_RX_COAL_TIMER);
> +	mtdcri(SDR0, DCRN_SDR0_ICCTRRX1, CONFIG_IBM_NEW_EMAC_RX_COAL_TIMER);
> +#if defined(CONFIG_460GT)
> +	mtdcri(SDR0, DCRN_SDR0_ICCTRTX2, CONFIG_IBM_NEW_EMAC_TX_COAL_TIMER);
> +	mtdcri(SDR0, DCRN_SDR0_ICCTRTX3, CONFIG_IBM_NEW_EMAC_TX_COAL_TIMER);
> +	mtdcri(SDR0, DCRN_SDR0_ICCTRRX2, CONFIG_IBM_NEW_EMAC_RX_COAL_TIMER);
> +	mtdcri(SDR0, DCRN_SDR0_ICCTRRX3, CONFIG_IBM_NEW_EMAC_RX_COAL_TIMER);
> +#endif
> +
> +	/* Enable the Tx/Rx Coalescing interrupt */
> +	val = (CONFIG_IBM_NEW_EMAC_TX_COAL_COUNT & COAL_FRAME_MASK)
> +			<< SDR0_ICC_FTHR_SHIFT;
> +	mtdcri(SDR0, DCRN_SDR0_ICCRTX0, val);
> +	mtdcri(SDR0, DCRN_SDR0_ICCRTX1, val);
> +#if defined(CONFIG_460GT)
> +	mtdcri(SDR0, DCRN_SDR0_ICCRTX2, val);
> +	mtdcri(SDR0, DCRN_SDR0_ICCRTX3, val);
> +#endif
> +
> +	val = (CONFIG_IBM_NEW_EMAC_RX_COAL_COUNT & COAL_FRAME_MASK)
> +			<< SDR0_ICC_FTHR_SHIFT;
> +	mtdcri(SDR0, DCRN_SDR0_ICCRRX0, val);
> +	mtdcri(SDR0, DCRN_SDR0_ICCRRX1, val);
> +#if defined(CONFIG_460GT)
> +	mtdcri(SDR0, DCRN_SDR0_ICCRRX2, val);
> +	mtdcri(SDR0, DCRN_SDR0_ICCRRX3, val);
> +#endif
> +#endif
> +	printk(KERN_INFO "MAL: Enabled Intr Coal TxCnt: %d RxCnt: %d\n",
> +		CONFIG_IBM_NEW_EMAC_TX_COAL_COUNT,
> +		CONFIG_IBM_NEW_EMAC_RX_COAL_COUNT);
> +}
> +#endif

This is all quite wrong. Either use MAL features or some other runtime
check, possibly via the "compatible" property.

Same goes with the SDR register definitions. Prefix them with the SOC
name but don't make them conditionally compiled. This is all back to the
same mess we had in arch/ppc and I'm not going to accept it.

Also, this coalescing option, while it makes sense to have a CONFIG
option to compile in the support for it or not, the choice to use
coalescing or not should be done at runtime. Same goes with the various
thresholds which should be runtime configurable.

There are existing mechanisms via ethtool to configure coalescing. You
should hookup onto these.


Cheers,
Ben.




^ permalink raw reply

* RE: [PATCH 1/2] ibm_newemac: Add Support for MAL Interrupt Coalescing
From: Prodyut Hazarika @ 2009-09-21 23:49 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Victor Gallardo, Feng Kan, netdev, lada.podivin, Loc Ho,
	bhutchings, linuxppc-dev, davem
In-Reply-To: <1253576514.7103.165.camel@pasglop>

Hi Ben,
Thanks for your comments.


> What happens if we build a kernel that is supposed to boot with two
> different variants of 405 or 440 ?

We cannot build a kernel with H/W Interrupt coalescing other than in
405EX/460EX/GT.
This is controlled via KConfig (config IBM_NEW_EMAC_INTR_COALESCE
depends on IBM_NEW_EMAC && (460EX || 460GT || 405EX))
Is this approach acceptable (via Kconfig)?


> There are existing mechanisms via ethtool to configure coalescing. You
> should hookup onto these.

I will start looking at the ethtool options

Thanks
Prodyut

^ permalink raw reply

* Re: fanotify as syscalls
From: Jamie Lokier @ 2009-09-21 23:56 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Eric Paris, Linus Torvalds, Evgeniy Polyakov, David Miller,
	linux-kernel, linux-fsdevel, netdev, viro, alan, hch
In-Reply-To: <200909220109.05995.agruen@suse.de>

Andreas Gruenbacher wrote:
> If the antimalware vendors want to base their decisions on pathnames then 
> that's their decision, and they can check /proc/self/fd/N.

Race hazards and loopholes.  It doesn't work.

> Waiting for your code to demonstrate; an object based cache (e.g.,
> st_dev + st_ino) rather than a pathname based cache would seem more
> reasonable.

Nearly everything that people do with files involves paths.  The point
is to cache what people (or their programs) do.  Apache does not
consult inodes by number, and rsync does not write inodes by number :-)
Yes, to the code...

> > > but I see no need for access decisions on them.
> >
> > Please excuse me; I'm a bit confused.  Is fanotify intended just for
> > use by access decision programs, or is the plan now for it to also be
> > a replacement for inotify?  I'm getting conflicting signals about
> > that.
> 
> Inotify doesn't support access decisions. So where's the problem with 
> having "notify only" events for directory / mount / unmount events?

No problem here.

You seemed to be saying you want to add directory events to fanotify.
But if fanotify is only intended for access decisions?  Something I
must have misunderstood in that.

> > If it's just for access decision programs, and if those aren't going
> > to care about location, then there's no need to add directory events
> > to fanotify at all.  But then I'll be demanding subtree support in
> > inotify, please :-)
> >
> > > Even less so for mounts and unmounts.
> >
> >    (as root) mkdir foo; mount dodgy foo -oloop; mount --bind foo/cat
> > /bin/cat
> 
> ... and then someone accesses /bin/cat, which triggers a fanotify access 
> decision.

That's fine as long as there was no location-awareness in the logic
which checked foo/innocent.txt and set that inode's "read-ok,cache-me" bit.

Mount only matters if you're sensitive to location.  If you think
location-independent checks make good anti-malware
I_have_a_bridge_to_sell^H^H^H^H^H^H^H^H^H^H^Hfine with me :-)

-- Jamie

^ permalink raw reply

* RE: [PATCH 1/2] ibm_newemac: Add Support for MAL Interrupt Coalescing
From: Prodyut Hazarika @ 2009-09-22  0:05 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: netdev, Feng Kan, Loc Ho, Victor Gallardo, bhutchings,
	linuxppc-dev, davem, jwboyer, lada.podivin
In-Reply-To: <1253576514.7103.165.camel@pasglop>

Hi Ben,
Thanks again for your comments.

> Same goes with the SDR register definitions. Prefix them with the SOC
> name but don't make them conditionally compiled.

I will add the base address in the Device tree, and make all register
definitions based on offset from the base in the next version of this
patch.

> Also, this coalescing option, while it makes sense to have a CONFIG
> option to compile in the support for it or not, the choice to use
> coalescing or not should be done at runtime. Same goes with the
various
> thresholds which should be runtime configurable.

Thanks for this comment. I will hookup ethtool with the EMAC driver, but
the MAL driver will come up with default coalesce options (as defined in
the appropriate defconfig file). The user will be able to change these
parameters as needed using ethtool.

I will get all the changes in place in the next version of this patch.

Thanks
Prodyut




^ permalink raw reply

* RE: [PATCH 1/2] ibm_newemac: Add Support for MAL Interrupt Coalescing
From: Benjamin Herrenschmidt @ 2009-09-22  0:07 UTC (permalink / raw)
  To: Prodyut Hazarika
  Cc: Victor Gallardo, Feng Kan, netdev, lada.podivin, Loc Ho,
	bhutchings, linuxppc-dev, davem
In-Reply-To: <0CA0A16855646F4FA96D25A158E299D606FFE7FF@SDCEXCHANGE01.ad.amcc.com>

On Mon, 2009-09-21 at 16:49 -0700, Prodyut Hazarika wrote:
> Hi Ben,
> Thanks for your comments.
> 
> 
> > What happens if we build a kernel that is supposed to boot with two
> > different variants of 405 or 440 ?
> 
> We cannot build a kernel with H/W Interrupt coalescing other than in
> 405EX/460EX/GT.
> This is controlled via KConfig (config IBM_NEW_EMAC_INTR_COALESCE
> depends on IBM_NEW_EMAC && (460EX || 460GT || 405EX))
> Is this approach acceptable (via Kconfig)?

No. That's my point. All of this must be runtime options. The kernel
must be buildablt for 460EX -and- 460GT - and an old 440EP if I want to
in a single image, and this -with- the coalescing option enabled. It
would obviously only be available when running on the cores that support
it, but it should -not- be a compile time decision.

IE. All your ifdef's should be turned into runtime checks. If you have
conflicting #define for register names and bits, then prefix them with
the SoC name.

The only acceptable compile-time option is to have the ability to not
compile the coalescing support at all, thus avoiding bloat when building
configs that are only targeted toward processors that don't have it or
setups that don't want it. 

> > There are existing mechanisms via ethtool to configure coalescing. You
> > should hookup onto these.
> 
> I will start looking at the ethtool options

Thanks.

Cheers,
Ben.

^ permalink raw reply

* RE: [PATCH 1/2] ibm_newemac: Add Support for MAL Interrupt Coalescing
From: Benjamin Herrenschmidt @ 2009-09-22  0:12 UTC (permalink / raw)
  To: Prodyut Hazarika
  Cc: Victor Gallardo, Feng Kan, netdev, lada.podivin, Loc Ho,
	bhutchings, linuxppc-dev, davem
In-Reply-To: <0CA0A16855646F4FA96D25A158E299D606FFE802@SDCEXCHANGE01.ad.amcc.com>

On Mon, 2009-09-21 at 17:05 -0700, Prodyut Hazarika wrote:
> Hi Ben,
> Thanks again for your comments.
> 
> > Same goes with the SDR register definitions. Prefix them with the SOC
> > name but don't make them conditionally compiled.
> 
> I will add the base address in the Device tree, and make all register
> definitions based on offset from the base in the next version of this
> patch.

That's a good idea. In fact, you can also use the dcr_read/write
variants of the accessors rather than the low level mfdcri/mtdcri. This
wouldn't make much of a difference unless you ever release a SoC with
those same registers behind an MMIO mapping but it's cleaner.

> Thanks for this comment. I will hookup ethtool with the EMAC driver, but
> the MAL driver will come up with default coalesce options (as defined in
> the appropriate defconfig file). The user will be able to change these
> parameters as needed using ethtool.

That's ok. I don't have an objection in using Kconfig to set the
defaults.

> I will get all the changes in place in the next version of this patch.

Thanks !

BTW. If you guys are ever going to do another change to MAL, please
please plase, add the -one- major missing feature that's causing all the
pain and complication in the current design: Add a per-channel interrupt
masking option.

The lack of ability to mask the interrupt per MAL channel is what forces
us to create that fake netdev structure in order to share the napi
device instance between all the EMACs in the system. This is very
inefficient too. We would be able to make things run a lot smoother if
we could just have a napi instance per EMAC, but for that, we need
per-channel interrupt masking.

Cheers,
Ben.

^ permalink raw reply

* Re: fanotify as syscalls
From: Eric W. Biederman @ 2009-09-22  0:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jamie Lokier, Evgeniy Polyakov, Eric Paris, David Miller,
	linux-kernel, linux-fsdevel, netdev, viro, alan, hch
In-Reply-To: <alpine.LFD.2.01.0909170934450.4950@localhost.localdomain>

Linus Torvalds <torvalds@linux-foundation.org> writes:

> Quite frankly, I have _never_ever_ seen a good reason for talking to the 
> kernel with some idiotic packet interface. It's just a fancy way to do 
> ioctl's, and everybody knows that ioctl's are bad and evil. Why are fancy 
> packet interfaces suddenly much better?

For working with the networking stack there are a lot of advantages because
netlink is the interface to everything in the network stack.

There are nice things like the packet to create a new interface is the same
packet the kernel sends everyone to report a new interface etc.

netlink also seems to get the structured data thing right.  You can
parse the packet even if you don't understand everything.  Each tag is
well defined like a syscall, taking exactly one kind of argument.
Which avoids the worst failure of ioctl in that you can't even parse
everything, and the argument may be a linked list in the calling
process or something else atrocious.

All of that said syscalls are good, and I would not recommend netlink
to anything not in the network stack.

Eric

^ permalink raw reply

* Re: fanotify as syscalls
From: Randy Dunlap @ 2009-09-22  0:22 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Jamie Lokier, Evgeniy Polyakov, Eric Paris,
	David Miller, linux-kernel, linux-fsdevel, netdev, viro, alan,
	hch
In-Reply-To: <m1r5tzq1gf.fsf@fess.ebiederm.org>

On Mon, 21 Sep 2009 17:15:28 -0700 Eric W. Biederman wrote:

> Linus Torvalds <torvalds@linux-foundation.org> writes:
> 
> > Quite frankly, I have _never_ever_ seen a good reason for talking to the 
> > kernel with some idiotic packet interface. It's just a fancy way to do 
> > ioctl's, and everybody knows that ioctl's are bad and evil. Why are fancy 
> > packet interfaces suddenly much better?
> 
> For working with the networking stack there are a lot of advantages because
> netlink is the interface to everything in the network stack.
> 
> There are nice things like the packet to create a new interface is the same
> packet the kernel sends everyone to report a new interface etc.
> 
> netlink also seems to get the structured data thing right.  You can
> parse the packet even if you don't understand everything.  Each tag is
> well defined like a syscall, taking exactly one kind of argument.
> Which avoids the worst failure of ioctl in that you can't even parse
> everything, and the argument may be a linked list in the calling
> process or something else atrocious.
> 
> All of that said syscalls are good, and I would not recommend netlink
> to anything not in the network stack.

like CONFIG_SCSI_NETLINK and CONFIG_QUOTA_NETLINK_INTERFACE  :(


---
~Randy

^ permalink raw reply

* Re: [PATCH 1/2] ibm_newemac: Add Support for MAL Interrupt Coalescing
From: prodyut hazarika @ 2009-09-22  0:28 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Victor Gallardo, Feng Kan, netdev, lada.podivin, Loc Ho,
	bhutchings, Prodyut Hazarika, linuxppc-dev, davem
In-Reply-To: <1253578361.7103.180.camel@pasglop>

Hi Ben,

>
> BTW. If you guys are ever going to do another change to MAL, please
> please plase, add the -one- major missing feature that's causing all the
> pain and complication in the current design: Add a per-channel interrupt
> masking option.
>
> The lack of ability to mask the interrupt per MAL channel is what forces
> us to create that fake netdev structure in order to share the napi
> device instance between all the EMACs in the system. This is very
> inefficient too. We would be able to make things run a lot smoother if
> we could just have a napi instance per EMAC, but for that, we need
> per-channel interrupt masking.
>

I will add a patch for the above as soon as I am done incorporating
your comments on the MAL coalescing support.

Thanks
Prodyut

^ permalink raw reply

* Re: [PATCH 1/2] ibm_newemac: Add Support for MAL Interrupt Coalescing
From: Benjamin Herrenschmidt @ 2009-09-22  0:39 UTC (permalink / raw)
  To: prodyut hazarika
  Cc: Victor Gallardo, Feng Kan, netdev, lada.podivin, Loc Ho,
	bhutchings, Prodyut Hazarika, linuxppc-dev, davem
In-Reply-To: <49c0ff980909211728s2d39e356p6900d047c6918826@mail.gmail.com>

On Mon, 2009-09-21 at 17:28 -0700, prodyut hazarika wrote:
> > BTW. If you guys are ever going to do another change to MAL, please
> > please plase, add the -one- major missing feature that's causing all
> the
> > pain and complication in the current design: Add a per-channel
> interrupt
> > masking option.
> >
> > The lack of ability to mask the interrupt per MAL channel is what
> forces
> > us to create that fake netdev structure in order to share the napi
> > device instance between all the EMACs in the system. This is very
> > inefficient too. We would be able to make things run a lot smoother
> if
> > we could just have a napi instance per EMAC, but for that, we need
> > per-channel interrupt masking.
> >
> 
> I will add a patch for the above as soon as I am done incorporating
> your comments on the MAL coalescing support.
> 
Well... the above is a HW limitation :-) IE. I was suggesting you fix
the HW, but in the case where you already did and the current MAL in
your SoC can indeed mask the interrupt per-channel, then that's great
and we should definitely look into having the driver go back to a more
standard NAPI model on MALs that have that capability.

Cheers,
Ben.

^ permalink raw reply

* [PATCH][RESEND] IPv6: 6rd tunnel mode
From: Alexandre Cassen @ 2009-09-22  0:39 UTC (permalink / raw)
  To: netdev

This patch add support to 6rd tunnel mode currently targetting
standard track at the IETF.

IPv6 rapid deployment (RFC5569) builds upon mechanisms of 6to4 (RFC3056)
to enable a service provider to rapidly deploy IPv6 unicast service
to IPv4 sites to which it provides customer premise equipment.  Like
6to4, it utilizes stateless IPv6 in IPv4 encapsulation in order to
transit IPv4-only network infrastructure. Unlike 6to4, a 6rd service
provider uses an IPv6 prefix of its own in place of the fixed 6to4
prefix.

Signed-off-by: Alexandre Cassen <acassen@freebox.fr>
---
 include/linux/if_tunnel.h |   10 +++++
 include/net/ipip.h        |    2 +
 net/ipv6/Kconfig          |   13 +++++++
 net/ipv6/sit.c            |   84 +++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 109 insertions(+), 0 deletions(-)

diff --git a/include/linux/if_tunnel.h b/include/linux/if_tunnel.h
index 5eb9b0f..0d44376 100644
--- a/include/linux/if_tunnel.h
+++ b/include/linux/if_tunnel.h
@@ -15,6 +15,10 @@
 #define SIOCADDPRL      (SIOCDEVPRIVATE + 5)
 #define SIOCDELPRL      (SIOCDEVPRIVATE + 6)
 #define SIOCCHGPRL      (SIOCDEVPRIVATE + 7)
+#define SIOCGET6RD      (SIOCDEVPRIVATE + 8)
+#define SIOCADD6RD      (SIOCDEVPRIVATE + 9)
+#define SIOCDEL6RD      (SIOCDEVPRIVATE + 10)
+#define SIOCCHG6RD      (SIOCDEVPRIVATE + 11)
 
 #define GRE_CSUM	__cpu_to_be16(0x8000)
 #define GRE_ROUTING	__cpu_to_be16(0x4000)
@@ -51,6 +55,12 @@ struct ip_tunnel_prl {
 /* PRL flags */
 #define	PRL_DEFAULT		0x0001
 
+/* 6RD parms */
+struct ip_tunnel_6rd {
+	struct in6_addr		addr;
+	__u8			prefixlen;
+};
+
 enum
 {
 	IFLA_GRE_UNSPEC,
diff --git a/include/net/ipip.h b/include/net/ipip.h
index 5d3036f..fa92c41 100644
--- a/include/net/ipip.h
+++ b/include/net/ipip.h
@@ -26,6 +26,8 @@ struct ip_tunnel
 
 	struct ip_tunnel_prl_entry	*prl;		/* potential router list */
 	unsigned int			prl_count;	/* # of entries in PRL */
+
+	struct ip_tunnel_6rd	ip6rd_prefix;	/* 6RD SP prefix */
 };
 
 /* ISATAP: default interval between RS in secondy */
diff --git a/net/ipv6/Kconfig b/net/ipv6/Kconfig
index ead6c7a..78a565b 100644
--- a/net/ipv6/Kconfig
+++ b/net/ipv6/Kconfig
@@ -170,6 +170,19 @@ config IPV6_SIT
 
 	  Saying M here will produce a module called sit. If unsure, say Y.
 
+config IPV6_SIT_6RD
+	bool "IPv6: 6rd tunnel mode (EXPERIMENTAL)"
+	depends on IPV6_SIT && EXPERIMENTAL
+	default n
+	---help---
+	IPv6 rapid deployment (RFC5569) builds upon mechanisms of 6to4 (RFC3056)
+	to enable a service provider to rapidly deploy IPv6 unicast service
+	to IPv4 sites to which it provides customer premise equipment.  Like
+	6to4, it utilizes stateless IPv6 in IPv4 encapsulation in order to
+	transit IPv4-only network infrastructure. Unlike 6to4, a 6rd service
+	provider uses an IPv6 prefix of its own in place of the fixed 6to4
+	prefix.
+
 config IPV6_NDISC_NODETYPE
 	bool
 
diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index 0ae4f64..ff62e97 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -604,6 +604,30 @@ static inline __be32 try_6to4(struct in6_addr *v6dst)
 	return dst;
 }
 
+#ifdef CONFIG_IPV6_SIT_6RD
+/* Returns the embedded IPv4 address if the IPv6 address comes from
+   6rd rule */
+
+static inline __be32 try_6rd(struct in6_addr *addr, u8 prefix_len, struct in6_addr *v6dst)
+{
+	__be32 dst = 0;
+
+	/* isolate addr according to mask */
+	if (ipv6_prefix_equal(v6dst, addr, prefix_len)) {
+		unsigned int d32_off, bits;
+
+		d32_off = prefix_len >> 5;
+		bits = (prefix_len & 0x1f);
+
+		dst = (ntohl(v6dst->s6_addr32[d32_off]) << bits);
+		if (bits)
+			dst |= ntohl(v6dst->s6_addr32[d32_off + 1]) >> (32 - bits);
+		dst = htonl(dst);
+	}
+	return dst;
+}
+#endif
+
 /*
  *	This function assumes it is being called from dev_queue_xmit()
  *	and that skb is filled properly by that function.
@@ -657,6 +681,13 @@ static netdev_tx_t ipip6_tunnel_xmit(struct sk_buff *skb,
 			goto tx_error;
 	}
 
+#ifdef CONFIG_IPV6_SIT_6RD
+	if (!dst && tunnel->ip6rd_prefix.prefixlen)
+		dst = try_6rd(&tunnel->ip6rd_prefix.addr,
+			      tunnel->ip6rd_prefix.prefixlen,
+			      &iph6->daddr);
+	else
+#endif
 	if (!dst)
 		dst = try_6to4(&iph6->daddr);
 
@@ -848,6 +879,9 @@ ipip6_tunnel_ioctl (struct net_device *dev, struct ifreq *ifr, int cmd)
 	int err = 0;
 	struct ip_tunnel_parm p;
 	struct ip_tunnel_prl prl;
+#ifdef CONFIG_IPV6_SIT_6RD
+	struct ip_tunnel_6rd ip6rd;
+#endif
 	struct ip_tunnel *t;
 	struct net *net = dev_net(dev);
 	struct sit_net *sitn = net_generic(net, sit_net_id);
@@ -987,6 +1021,56 @@ ipip6_tunnel_ioctl (struct net_device *dev, struct ifreq *ifr, int cmd)
 		netdev_state_change(dev);
 		break;
 
+#ifdef CONFIG_IPV6_SIT_6RD
+	case SIOCGET6RD:
+		err = -EINVAL;
+		if (dev == sitn->fb_tunnel_dev)
+			goto done;
+		err = -ENOENT;
+		if (!(t = netdev_priv(dev)))
+			goto done;
+		memcpy(&ip6rd, &t->ip6rd_prefix, sizeof(ip6rd));
+		if (copy_to_user(ifr->ifr_ifru.ifru_data, &ip6rd, sizeof(ip6rd)))
+			err = -EFAULT;
+		else
+			err = 0;
+		break;
+
+	case SIOCADD6RD:
+	case SIOCDEL6RD:
+	case SIOCCHG6RD:
+		err = -EPERM;
+		if (!capable(CAP_NET_ADMIN))
+			goto done;
+		err = -EINVAL;
+		if (dev == sitn->fb_tunnel_dev)
+			goto done;
+		err = -EFAULT;
+		if (copy_from_user(&ip6rd, ifr->ifr_ifru.ifru_data, sizeof(ip6rd)))
+			goto done;
+		err = -ENOENT;
+		if (!(t = netdev_priv(dev)))
+			goto done;
+
+		err = 0;
+		switch (cmd) {
+		case SIOCDEL6RD:
+			memset(&t->ip6rd_prefix, 0, sizeof(ip6rd));
+			break;
+		case SIOCADD6RD:
+		case SIOCCHG6RD:
+			if (ip6rd.prefixlen >= 95) {
+				err = -EINVAL;
+				goto done;
+			}
+			t->ip6rd_prefix.addr = ip6rd.addr;
+			t->ip6rd_prefix.prefixlen = ip6rd.prefixlen;
+			break;
+		}
+		netdev_state_change(dev);
+		break;
+#endif
+
 	default:
 		err = -EINVAL;
 	}
-- 
1.6.0.4


^ permalink raw reply related

* [PATCH iproute2][RESEND] IPv6: 6rd iproute2 support
From: Alexandre Cassen @ 2009-09-22  0:41 UTC (permalink / raw)
  To: netdev

This patch provide iproute2 facilities to configure 6rd tunnel. To
configure a 6rd tunnel, simply configure a sit tunnel and set
6rd prefix as following :

    ip tunnel add sit1 mode site local a.b.c.d ttl 64
    ip tunnel 6rd dev sit1 set-6rd_prefix xxxx:yyyy::/z

Additionaly you can reset 6rd_prefix :

    ip tunnel 6rd dev sit1 reset-6rd_prefix

Signed-off-by: Alexandre Cassen <acassen@freebox.fr>
---
 include/linux/if_tunnel.h |   10 ++++++++
 ip/iptunnel.c             |   53 ++++++++++++++++++++++++++++++++++++++++++++-
 ip/tunnel.c               |   17 +++++++++++++-
 ip/tunnel.h               |    2 +
 4 files changed, 80 insertions(+), 2 deletions(-)

diff --git a/include/linux/if_tunnel.h b/include/linux/if_tunnel.h
index 9229075..5ebe5a4 100644
--- a/include/linux/if_tunnel.h
+++ b/include/linux/if_tunnel.h
@@ -12,6 +12,10 @@
 #define SIOCADDPRL      (SIOCDEVPRIVATE + 5)
 #define SIOCDELPRL      (SIOCDEVPRIVATE + 6)
 #define SIOCCHGPRL      (SIOCDEVPRIVATE + 7)
+#define SIOCGET6RD      (SIOCDEVPRIVATE + 8)
+#define SIOCADD6RD      (SIOCDEVPRIVATE + 9)
+#define SIOCDEL6RD      (SIOCDEVPRIVATE + 10)
+#define SIOCCHG6RD      (SIOCDEVPRIVATE + 11)
 
 #define GRE_CSUM	__cpu_to_be16(0x8000)
 #define GRE_ROUTING	__cpu_to_be16(0x4000)
@@ -48,6 +52,12 @@ struct ip_tunnel_prl {
 /* PRL flags */
 #define	PRL_DEFAULT		0x0001
 
+/* 6RD parms */
+struct ip_tunnel_6rd {
+	struct in6_addr		addr;
+	__u8			prefixlen;
+};
+
 enum
 {
 	IFLA_GRE_UNSPEC,
diff --git a/ip/iptunnel.c b/ip/iptunnel.c
index 338d8bd..31843ad 100644
--- a/ip/iptunnel.c
+++ b/ip/iptunnel.c
@@ -38,10 +38,11 @@ static void usage(void) __attribute__((noreturn));
 
 static void usage(void)
 {
-	fprintf(stderr, "Usage: ip tunnel { add | change | del | show | prl } [ NAME ]\n");
+	fprintf(stderr, "Usage: ip tunnel { add | change | del | show | prl | 6rd } [ NAME ]\n");
 	fprintf(stderr, "          [ mode { ipip | gre | sit | isatap } ] [ remote ADDR ] [ local ADDR ]\n");
 	fprintf(stderr, "          [ [i|o]seq ] [ [i|o]key KEY ] [ [i|o]csum ]\n");
 	fprintf(stderr, "          [ prl-default ADDR ] [ prl-nodefault ADDR ] [ prl-delete ADDR ]\n");
+	fprintf(stderr, "          [ set-6rd_prefix ADDR ] [ reset-6rd_prefix ]\n");
 	fprintf(stderr, "          [ ttl TTL ] [ tos TOS ] [ [no]pmtudisc ] [ dev PHYS_DEV ]\n");
 	fprintf(stderr, "\n");
 	fprintf(stderr, "Where: NAME := STRING\n");
@@ -308,11 +309,13 @@ static int do_del(int argc, char **argv)
 
 static void print_tunnel(struct ip_tunnel_parm *p)
 {
+	struct ip_tunnel_6rd ip6rd;
 	char s1[1024];
 	char s2[1024];
 	char s3[64];
 	char s4[64];
 
+	memset(&ip6rd, 0, sizeof(ip6rd));
 	inet_ntop(AF_INET, &p->i_key, s3, sizeof(s3));
 	inet_ntop(AF_INET, &p->o_key, s4, sizeof(s4));
 
@@ -368,6 +371,13 @@ static void print_tunnel(struct ip_tunnel_parm *p)
 	if (!(p->iph.frag_off&htons(IP_DF)))
 		printf(" nopmtudisc");
 
+	if (!tnl_ioctl_get_6rd(p->name, &ip6rd) && ip6rd.prefixlen) {
+		char buf[128];
+		printf(" 6rd_prefix %s/%u ",
+		       inet_ntop(AF_INET6, &ip6rd.addr, buf, 128),
+		       ip6rd.prefixlen);
+	}
+
 	if ((p->i_flags&GRE_KEY) && (p->o_flags&GRE_KEY) && p->o_key == p->i_key)
 		printf(" key %s", s3);
 	else if ((p->i_flags|p->o_flags)&GRE_KEY) {
@@ -534,6 +544,45 @@ static int do_prl(int argc, char **argv)
 	return tnl_prl_ioctl(cmd, medium, &p);
 }
 
+static int do_6rd(int argc, char **argv)
+{
+	struct ip_tunnel_6rd ip6rd;
+	int devname = 0;
+	int cmd = 0;
+	char medium[IFNAMSIZ];
+
+	memset(&ip6rd, 0, sizeof(ip6rd));
+	memset(&medium, 0, sizeof(medium));
+
+	while (argc > 0) {
+		if (strcmp(*argv, "set-6rd_prefix") == 0) {
+			inet_prefix prefix;
+			NEXT_ARG();
+			if (get_prefix(&prefix, *argv, AF_INET6))
+				invarg("invalid 6rd_prefix\n", *argv);
+			cmd = SIOCADD6RD;
+			memcpy(&ip6rd.addr, prefix.data, 16);
+			ip6rd.prefixlen = prefix.bitlen;
+		} else if (strcmp(*argv, "reset-6rd_prefix") == 0) {
+			cmd = SIOCDEL6RD;
+		} else if (strcmp(*argv, "dev") == 0) {
+			NEXT_ARG();
+			strncpy(medium, *argv, IFNAMSIZ-1);
+			devname++;
+		} else {
+			fprintf(stderr,"%s: Invalid 6RD parameter.\n", *argv);
+			exit(-1);
+		}
+		argc--; argv++;
+	}
+	if (devname == 0) {
+		fprintf(stderr, "Must specify dev.\n");
+		exit(-1);
+	}
+
+	return tnl_6rd_ioctl(cmd, medium, &ip6rd);
+}
+
 int do_iptunnel(int argc, char **argv)
 {
 	switch (preferred_family) {
@@ -567,6 +616,8 @@ int do_iptunnel(int argc, char **argv)
 			return do_show(argc-1, argv+1);
 		if (matches(*argv, "prl") == 0)
 			return do_prl(argc-1, argv+1);
+		if (matches(*argv, "6rd") == 0)
+			return do_6rd(argc-1, argv+1);
 		if (matches(*argv, "help") == 0)
 			usage();
 	} else
diff --git a/ip/tunnel.c b/ip/tunnel.c
index d1296e6..d389e86 100644
--- a/ip/tunnel.c
+++ b/ip/tunnel.c
@@ -168,7 +168,7 @@ int tnl_del_ioctl(const char *basedev, const char *name, void *p)
 	return err;
 }
 
-int tnl_prl_ioctl(int cmd, const char *name, void *p)
+static int tnl_gen_ioctl(int cmd, const char *name, void *p)
 {
 	struct ifreq ifr;
 	int fd;
@@ -183,3 +183,18 @@ int tnl_prl_ioctl(int cmd, const char *name, void *p)
 	close(fd);
 	return err;
 }
+
+int tnl_prl_ioctl(int cmd, const char *name, void *p)
+{
+	return tnl_gen_ioctl(cmd, name, p);
+}
+
+int tnl_6rd_ioctl(int cmd, const char *name, void *p)
+{
+	return tnl_gen_ioctl(cmd, name, p);
+}
+
+int tnl_ioctl_get_6rd(const char *name, void *p)
+{
+	return tnl_gen_ioctl(SIOCGET6RD, name, p);
+}
diff --git a/ip/tunnel.h b/ip/tunnel.h
index 0661e27..ded226b 100644
--- a/ip/tunnel.h
+++ b/ip/tunnel.h
@@ -32,5 +32,7 @@ int tnl_get_ioctl(const char *basedev, void *p);
 int tnl_add_ioctl(int cmd, const char *basedev, const char *name, void *p);
 int tnl_del_ioctl(const char *basedev, const char *name, void *p);
 int tnl_prl_ioctl(int cmd, const char *name, void *p);
+int tnl_6rd_ioctl(int cmd, const char *name, void *p);
+int tnl_ioctl_get_6rd(const char *name, void *p);
 
 #endif
-- 
1.6.0.4


^ permalink raw reply related

* [PATCH] fec: Add FEC support for MX25 processor
From: Fabio Estevam @ 2009-09-22  0:41 UTC (permalink / raw)
  To: netdev; +Cc: s.hauer

Add FEC support for MX25 processor.

Signed-off-by: Fabio Estevam <fabio.estevam@freescale.com>
---
 drivers/net/Kconfig |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index ed5741b..2bea67c 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -1875,7 +1875,7 @@ config 68360_ENET
 
 config FEC
 	bool "FEC ethernet controller (of ColdFire and some i.MX CPUs)"
-	depends on M523x || M527x || M5272 || M528x || M520x || M532x || MACH_MX27 || ARCH_MX35
+	depends on M523x || M527x || M5272 || M528x || M520x || M532x || MACH_MX27 || ARCH_MX35 || ARCH_MX25
 	help
 	  Say Y here if you want to use the built-in 10/100 Fast ethernet
 	  controller on some Motorola ColdFire and Freescale i.MX processors.
-- 
1.6.0.4


      

^ permalink raw reply related

* RE: [PATCH 1/2] ibm_newemac: Add Support for MAL Interrupt Coalescing
From: Prodyut Hazarika @ 2009-09-22  0:53 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, prodyut hazarika
  Cc: netdev, Feng Kan, Loc Ho, Victor Gallardo, bhutchings,
	linuxppc-dev, davem, jwboyer, lada.podivin
In-Reply-To: <1253579943.7103.194.camel@pasglop>

Hi Ben,

> Well... the above is a HW limitation :-) IE. I was suggesting you fix
> the HW, but in the case where you already did and the current MAL in
> your SoC can indeed mask the interrupt per-channel, then that's great
> and we should definitely look into having the driver go back to a more
> standard NAPI model on MALs that have that capability.

In the newer revs of 460EX/GT and 405EX, we have Interrupt coalescing
both on Tx and Rx per channel (physical not virtual), which can be
enabled/disabled per channel via UIC. The Tx/Rx Coalesce mappings are
defined in the dts file. But in the older revs, there is only a global
EOP_Int_Enable in the MAL configuration register. There can be a
possible way even for older SoCs if we use the MAL descriptor I bit and
not the global EOP_Int_Enable. But to turn on/off the channel, we will
have to go and set/clear the I bit in whole of MAL descriptor ring for
that channel. That might be really inefficient.

What would you suggest?

Thanks
Prodyut



^ permalink raw reply

* RE: [PATCH 1/2] ibm_newemac: Add Support for MAL Interrupt Coalescing
From: Benjamin Herrenschmidt @ 2009-09-22  1:09 UTC (permalink / raw)
  To: Prodyut Hazarika
  Cc: Victor Gallardo, Feng Kan, netdev, lada.podivin, Loc Ho,
	linuxppc-dev, bhutchings, prodyut hazarika, davem
In-Reply-To: <0CA0A16855646F4FA96D25A158E299D606FFE81A@SDCEXCHANGE01.ad.amcc.com>

On Mon, 2009-09-21 at 17:53 -0700, Prodyut Hazarika wrote:
> 
> In the newer revs of 460EX/GT and 405EX, we have Interrupt coalescing
> both on Tx and Rx per channel (physical not virtual), which can be
> enabled/disabled per channel via UIC. The Tx/Rx Coalesce mappings are
> defined in the dts file. But in the older revs, there is only a global
> EOP_Int_Enable in the MAL configuration register. There can be a
> possible way even for older SoCs if we use the MAL descriptor I bit
> and
> not the global EOP_Int_Enable. But to turn on/off the channel, we will
> have to go and set/clear the I bit in whole of MAL descriptor ring for
> that channel. That might be really inefficient.
> 
> What would you suggest?

I wouldn't bother with the old SoCs, we should keep the current
workaround we have today for them. For the new ones, I'll have a look
and see how we can get the driver upgraded to avoid the workaround.

Don't bother with this for now. I'll dig at some stage.

Cheers,
Ben.

^ permalink raw reply

* [PATCH] cnic: Shutdown iSCSI ring during uio_close.
From: Michael Chan @ 2009-09-22  1:39 UTC (permalink / raw)
  To: davem; +Cc: netdev, michaelc, Michael Chan, Benjamin Li

The iSCSI ring should be shutdown during uio_close instead of uio_open
for proper operations.  This fixes the problem of the ring getting
stuck intermittently.

Signed-off-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: Benjamin Li <benli@broadcom.com>
---
 drivers/net/cnic.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/cnic.c b/drivers/net/cnic.c
index d45eacb..211c8e9 100644
--- a/drivers/net/cnic.c
+++ b/drivers/net/cnic.c
@@ -85,8 +85,6 @@ static int cnic_uio_open(struct uio_info *uinfo, struct inode *inode)
 
 	cp->uio_dev = iminor(inode);
 
-	cnic_shutdown_bnx2_rx_ring(dev);
-
 	cnic_init_bnx2_tx_ring(dev);
 	cnic_init_bnx2_rx_ring(dev);
 
@@ -98,6 +96,8 @@ static int cnic_uio_close(struct uio_info *uinfo, struct inode *inode)
 	struct cnic_dev *dev = uinfo->priv;
 	struct cnic_local *cp = dev->cnic_priv;
 
+	cnic_shutdown_bnx2_rx_ring(dev);
+
 	cp->uio_dev = -1;
 	return 0;
 }
-- 
1.6.4.GIT



^ permalink raw reply related

* Re: [PATCH][RESEND] IPv6: 6rd tunnel mode
From: Brian Haley @ 2009-09-22  2:39 UTC (permalink / raw)
  To: Alexandre Cassen; +Cc: netdev
In-Reply-To: <20090922003956.GA19947@lnxos.staff.proxad.net>

Hi Alexandre,

Alexandre Cassen wrote:
> This patch add support to 6rd tunnel mode currently targetting
> standard track at the IETF.
> 
> IPv6 rapid deployment (RFC5569) builds upon mechanisms of 6to4 (RFC3056)
> to enable a service provider to rapidly deploy IPv6 unicast service
> to IPv4 sites to which it provides customer premise equipment.  Like
> 6to4, it utilizes stateless IPv6 in IPv4 encapsulation in order to
> transit IPv4-only network infrastructure. Unlike 6to4, a 6rd service
> provider uses an IPv6 prefix of its own in place of the fixed 6to4
> prefix.

I couldn't find RFC 5569 (delayed due to IPR rights?), although I did find
the latest 6rd draft, -03.  It was showing as Informational, not Standards
track, is that right?  Just curious.

> +		case SIOCADD6RD:
> +		case SIOCCHG6RD:
> +			if (ip6rd.prefixlen >= 95) {
> +				err = -EINVAL;
> +				goto done;
> +			}
> +			t->ip6rd_prefix.addr = ip6rd.addr;

ipv6_addr_copy(&t->ip6rd_prefix.addr, &ip6rd.addr); is the preferred way to
copy the address.

-Brian

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox