Netdev List

Netdev List
 help / color / mirror / Atom feed

* IPV6_PKTINFO socket option
From: Gerrit Renker @ 2007-08-08 11:16 UTC (permalink / raw)
  To: yoshfuji; +Cc: netdev

Yoshifuji-san,

a few weeks earlier I enquired about the IPV6_PKTINFO socket option to get at the
destination address of datagrams, where you replied that this option is `deprecated'.

There are three problems:
 
 1. On i386 it works as described in section 4 of RFC 3542, using IPV6_RECVPKTINFO
    as sticky socket option to pull out the IPV6_PKTINFO cmsg header fields.
   
 2. On sparc64 with the same kernel IPV6_PKTINFO works without problems, even pulls
    out the cmsg fields correctly. Conversely, when trying to set the IPV6_RECVPKTINFO
    sticky option on the socket, no cmsg fields are generated. 
    The kernel is of the same date and revision as the i386 kernel - library issue ???
    It is very annoying, since the application needs to run on both architectures.

 3. Misc:
    * The option is mentioned 9 times in the index of "Unix Network Programming"
      (3rd ed., p. 969). Moreover, an entire section (27.7) is devoted to this topic.
    * It might be good to give at least a warning message in the syslog that the
      IPV6_PKTINFO socket option is no longer supported. That would save many users
      grief. An example how this was done in DCCP is in net/dccp/proto.c:

        case DCCP_SOCKOPT_PACKET_SIZE:
                DCCP_WARN("sockopt(PACKET_SIZE) is deprecated: fix your app");
   

Maybe this is due to a misunderstanding - in which case I'd be grateful for any clarifications.

Thanks
Gerrit

^ permalink raw reply

* [ofa-general] Re: [PATCH 2/9 Rev3] [core] Add skb_blist & hard_start_xmit_batch
From: Krishna Kumar2 @ 2007-08-08 11:24 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: jagana, johnpol, herbert, gaagaan, Robert.Olsson, kumarkr,
	mcarlson, peter.p.waskiewicz.jr, hadi, kaber, jeff, general,
	mchan, tgraf, netdev, sri, davem, rdreier
In-Reply-To: <20070808065946.6c80c470@oldman>

Stephen Hemminger <shemminger@linux-foundation.org> wrote on 08/08/2007
04:29:46 PM:

> How do qdisc's that do rate control work now? Did you add logic to
> breakup batches for token bucket, etc.

My thought was that the code works identically to the current code except
in requeue cases. For requeue the existing code sched decides which skb
to pass next (first) to device; while in batching case it is assumed that
skbs already in batch list are part of the hardware queue and sent first.

If that is a problem - at the time of enabling batching it is possible
to check what qdisc_ops is being used (or whenever the qdisc's change),
or maybe add a flag to each qdisc_ops to indicate whether qdisc supports
batching, etc ? Like definitely enable batching if using pfifo_fast_ops..

What is your suggestion ?

thanks,

- KK

^ permalink raw reply

* Re: 2.6.20->2.6.21 - networking dies after random time
From: Jarek Poplawski @ 2007-08-08 11:42 UTC (permalink / raw)
  To: Marcin Ślusarz
  Cc: Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox
In-Reply-To: <4bacf17f0708080409t116b5c84ye60dff7da51d0fdf@mail.gmail.com>

Read below please:

On Wed, Aug 08, 2007 at 01:09:36PM +0200, Marcin Ślusarz wrote:
> 2007/8/7, Jarek Poplawski <jarkao2@o2.pl>:
> > So, the let's try this idea yet: modified Ingo's "x86: activate
> > HARDIRQS_SW_RESEND" patch.
> > (Don't forget about make oldconfig before make.)
> > For testing only.
> >
> > Cheers,
> > Jarek P.
> >
> > PS: alas there was not even time for "compile checking"...
> >
> > ---
> >
> > diff -Nurp 2.6.22.1-/arch/i386/Kconfig 2.6.22.1/arch/i386/Kconfig
> > --- 2.6.22.1-/arch/i386/Kconfig 2007-07-09 01:32:17.000000000 +0200
> > +++ 2.6.22.1/arch/i386/Kconfig  2007-08-07 13:13:03.000000000 +0200
> > @@ -1252,6 +1252,10 @@ config GENERIC_PENDING_IRQ
> >         depends on GENERIC_HARDIRQS && SMP
> >         default y
> >
> > +config HARDIRQS_SW_RESEND
> > +       bool
> > +       default y
> > +
> >  config X86_SMP
> >         bool
> >         depends on SMP && !X86_VOYAGER
> > diff -Nurp 2.6.22.1-/arch/x86_64/Kconfig 2.6.22.1/arch/x86_64/Kconfig
> > --- 2.6.22.1-/arch/x86_64/Kconfig       2007-07-09 01:32:17.000000000 +0200
> > +++ 2.6.22.1/arch/x86_64/Kconfig        2007-08-07 13:13:03.000000000 +0200
> > @@ -690,6 +690,10 @@ config GENERIC_PENDING_IRQ
> >         depends on GENERIC_HARDIRQS && SMP
> >         default y
> >
> > +config HARDIRQS_SW_RESEND
> > +       bool
> > +       default y
> > +
> >  menu "Power management options"
> >
> >  source kernel/power/Kconfig
> > diff -Nurp 2.6.22.1-/kernel/irq/manage.c 2.6.22.1/kernel/irq/manage.c
> > --- 2.6.22.1-/kernel/irq/manage.c       2007-07-09 01:32:17.000000000 +0200
> > +++ 2.6.22.1/kernel/irq/manage.c        2007-08-07 13:13:03.000000000 +0200
> > @@ -169,6 +169,14 @@ void enable_irq(unsigned int irq)
> >                 desc->depth--;
> >         }
> >         spin_unlock_irqrestore(&desc->lock, flags);
> > +#ifdef CONFIG_HARDIRQS_SW_RESEND
> > +       /*
> > +        * Do a bh disable/enable pair to trigger any pending
> > +        * irq resend logic:
> > +        */
> > +       local_bh_disable();
> > +       local_bh_enable();
> > +#endif
> >  }
> >  EXPORT_SYMBOL(enable_irq);
> >
> > diff -Nurp 2.6.22.1-/kernel/irq/resend.c 2.6.22.1/kernel/irq/resend.c
> > --- 2.6.22.1-/kernel/irq/resend.c       2007-07-09 01:32:17.000000000 +0200
> > +++ 2.6.22.1/kernel/irq/resend.c        2007-08-07 13:57:54.000000000 +0200
> > @@ -62,16 +62,24 @@ void check_irq_resend(struct irq_desc *d
> >          */
> >         desc->chip->enable(irq);
> >
> > +       /*
> > +        * Temporary hack to figure out more about the problem, which
> > +        * is causing the ancient network cards to die.
> > +        */
> > +
> >         if ((status & (IRQ_PENDING | IRQ_REPLAY)) == IRQ_PENDING) {
> >                 desc->status = (status & ~IRQ_PENDING) | IRQ_REPLAY;
> >
> > -               if (!desc->chip || !desc->chip->retrigger ||
> > -                                       !desc->chip->retrigger(irq)) {
> > +               if (desc->handle_irq == handle_edge_irq) {
> > +                       if (desc->chip->retrigger)
> > +                               desc->chip->retrigger(irq);
> > +                       return;
> > +               }
> >  #ifdef CONFIG_HARDIRQS_SW_RESEND
> > -                       /* Set it pending and activate the softirq: */
> > -                       set_bit(irq, irqs_resend);
> > -                       tasklet_schedule(&resend_tasklet);
> > +               WARN_ON_ONCE(1);
> > +               /* Set it pending and activate the softirq: */
> > +               set_bit(irq, irqs_resend);
> > +               tasklet_schedule(&resend_tasklet);
> >  #endif
> > -               }
> >         }
> >  }
> >
> Works fine with:

Very nice! It would be about time this kernel should start behave...

> WARNING: at kernel/irq/resend.c:79 check_irq_resend()
> 
> Call Trace:
>  [<ffffffff8025e660>] check_irq_resend+0xc0/0xd0
>  [<ffffffff8025e1cd>] enable_irq+0xed/0xf0
>  [<ffffffff8807f21d>] :8390:ei_start_xmit+0x14d/0x30c
>  [<ffffffff8024d055>] lock_release_non_nested+0xe5/0x190
>  [<ffffffff80539b78>] __qdisc_run+0x98/0x1f0
>  [<ffffffff80539b8e>] __qdisc_run+0xae/0x1f0
>  [<ffffffff8052b65e>] dev_hard_start_xmit+0x26e/0x2d0
>  [<ffffffff80539ba0>] __qdisc_run+0xc0/0x1f0
>  [<ffffffff8052dc2f>] dev_queue_xmit+0x24f/0x310
>  [<ffffffff805337a7>] neigh_resolve_output+0xe7/0x290
>  [<ffffffff8054f5c0>] dst_output+0x0/0x10
>  [<ffffffff80552aff>] ip_output+0x19f/0x340
>  [<ffffffff80551f77>] ip_queue_xmit+0x217/0x430
>  [<ffffffff80563b2a>] tcp_transmit_skb+0x40a/0x7c0
>  [<ffffffff805657bb>] __tcp_push_pending_frames+0x11b/0x940
>  [<ffffffff8055972a>] tcp_sendmsg+0x87a/0xc80
>  [<ffffffff80577735>] inet_sendmsg+0x45/0x80
>  [<ffffffff8051e2d4>] sock_aio_write+0x104/0x120
>  [<ffffffff80285fc1>] do_sync_write+0xf1/0x130
>  [<ffffffff80243290>] autoremove_wake_function+0x0/0x40
>  [<ffffffff802868e9>] vfs_write+0x159/0x170
>  [<ffffffff80286ef0>] sys_write+0x50/0x90
>  [<ffffffff802097fe>] system_call+0x7e/0x83
> 

So, it looks like x86_64 io_apic's IPI code was unused too long...
I hope it's a piece of cake for Ingo now...

Thanks very much Marcin!

If it's possible for you and Jean-Baptiste, try this today patch
with -rc2, and maybe once more this one patch (-rc1 or older).

Regards,
Jarek P. 


^ permalink raw reply

* Re: [PATCH 2/14] nes: device structures and defines
From: Andi Kleen @ 2007-08-08 12:38 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: ggrundstrom, rdreier, ewg, netdev
In-Reply-To: <46B918BA.2020400@garzik.org>

Jeff Garzik <jeff@garzik.org> writes:
> > +			val, reg_index, addr, addr+4); */
> > +	writel(cpu_to_le32(reg_index), addr);
> > +	writel(cpu_to_le32(val),(u8 *)addr + 4);
> 
> wrong -- endian conversion macros not needed with writel()

Are you sure? I don't think that's true. e.g. powerpc writel 
doesn't convert endian

-Andi


^ permalink raw reply

* Re: [PATCH 3/14] nes: connection manager routines
From: Andi Kleen @ 2007-08-08 12:41 UTC (permalink / raw)
  To: ggrundstrom; +Cc: rdreier, ewg, netdev
In-Reply-To: <200708080050.l780oGxo004694@neteffect.com>

ggrundstrom@neteffect.com writes:

> NetEffect connection manager routines.

This seems more like a new TCP stack. I don't think such code
is appropiate, since Linux already has one.

-Andi

^ permalink raw reply

* [PATCH 1/1] af_packet: don't enable timestamps in mmap'ed sockets
From: Unai Uribarri @ 2007-08-08 11:50 UTC (permalink / raw)
  To: netdev; +Cc: linux-kernel

Hello folks,

I've discovered two strange behaviours (bugs?) about timestamp
generation:

1. If a program opens an AF_PACKET socket and setup a reception ring
with setsockopt(sock, SOL_PACKET, PACKET_RX_RING), timestamps are
automatically (re)enabled at the reception of every packet.

2. Setting SOL_SOCKET/SO_TIMESTAMP to 0 doesn't disables timestamp
generation. Every skb continues begin timestamped until you close the
socket that activated it.

Timestamp generation is a heavy task that is consuming more than 50% of
the CPU (using ACPI PM clock) and is currently the bottleneck in my
packet capturing application.

The attached patch removes the automatic timestamp activation, that
only mmap'ed AF_PACKET sockets perform. I known it can break user
applications, but I believe that it's the correct solution.

I will be very pleased to receive any feedback.

Signed-off-by: Unai Uribarri <unai.uribarri@optenet.com>

---

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 1322d62..a4f2da3 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -640,10 +640,6 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev, struct packe
        h->tp_snaplen = snaplen;
        h->tp_mac = macoff;
        h->tp_net = netoff;
-       if (skb->tstamp.tv64 == 0) {
-               __net_timestamp(skb);
-               sock_enable_timestamp(sk);
-       }
        tv = ktime_to_timeval(skb->tstamp);
        h->tp_sec = tv.tv_sec;
        h->tp_usec = tv.tv_usec;

^ permalink raw reply related

* Re: [PATCH 2/14] nes: device structures and defines
From: Michael Buesch @ 2007-08-08 11:50 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Jeff Garzik, ggrundstrom, rdreier, ewg, netdev
In-Reply-To: <p73hcnarykt.fsf@bingen.suse.de>

On Wednesday 08 August 2007 14:38:10 Andi Kleen wrote:
> Jeff Garzik <jeff@garzik.org> writes:
> > > +			val, reg_index, addr, addr+4); */
> > > +	writel(cpu_to_le32(reg_index), addr);
> > > +	writel(cpu_to_le32(val),(u8 *)addr + 4);
> > 
> > wrong -- endian conversion macros not needed with writel()
> 
> Are you sure? I don't think that's true. e.g. powerpc writel 
> doesn't convert endian

Andi, you are wrong.
writeX take CPU endian args.

-- 
Greetings Michael.

^ permalink raw reply

* Re: 2.6.20->2.6.21 - networking dies after random time
From: Jarek Poplawski @ 2007-08-08 11:53 UTC (permalink / raw)
  To: Marcin Ślusarz
  Cc: Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox
In-Reply-To: <20070808114243.GC2426@ff.dom.local>

On Wed, Aug 08, 2007 at 01:42:43PM +0200, Jarek Poplawski wrote:
...
> So, it looks like x86_64 io_apic's IPI code was unused too long...

To be fair it's x86_64 lapic's IPI code.

Jarek P.

^ permalink raw reply

* [ofa-general] Re: [PATCH 2/9 Rev3] [core] Add skb_blist & hard_start_xmit_batch
From: Evgeniy Polyakov @ 2007-08-08 12:01 UTC (permalink / raw)
  To: Krishna Kumar
  Cc: jagana, Robert.Olsson, peter.p.waskiewicz.jr, herbert, gaagaan,
	kumarkr, rdreier, mcarlson, kaber, jeff, hadi, general, mchan,
	tgraf, netdev, shemminger, davem, sri
In-Reply-To: <20070808093135.15396.29687.sendpatchset@localhost.localdomain>

Hi Krishna.

On Wed, Aug 08, 2007 at 03:01:35PM +0530, Krishna Kumar (krkumar2@in.ibm.com) wrote:
> +int dev_change_tx_batch_skb(struct net_device *dev, unsigned long new_batch_skb)
> +{
> +	int ret = 0;
> +	struct sk_buff_head *blist;
> +
> +	if (!dev->hard_start_xmit_batch) {
> +		/* Driver doesn't support batching skb API */
> +		ret = -ENOTSUPP;
> +		goto out;
> +	}
> +
> +	/* Handle invalid argument */
> +	if (new_batch_skb < 0) {
> +		ret = -EINVAL;
> +		goto out;
> +	}

It is unsigned, how can it be less than zero?

And actually you use it just like a binary flag (casted to/from u32 in
the code, btw), so why not using ethtool_value directly here?

> +	/* Check if new value is same as the current */
> +	if (!!dev->skb_blist == !!new_batch_skb)
> +		goto out;
> +
> +	if (new_batch_skb &&
> +	    (blist = kmalloc(sizeof *blist, GFP_KERNEL)) == NULL) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	spin_lock(&dev->queue_lock);
> +	if (new_batch_skb) {
> +		skb_queue_head_init(blist);
> +		dev->skb_blist = blist;
> +	} else
> +		free_batching(dev);
> +	spin_unlock(&dev->queue_lock);

This needs bh lock too, since blist is accessed by qdisc_restart.

> +int dev_add_skb_to_blist(struct sk_buff *skb, struct net_device *dev)
> +{
> +	if (!list_empty(&ptype_all))
> +		dev_queue_xmit_nit(skb, dev);
> +
> +	if (netif_needs_gso(dev, skb)) {
> +		if (unlikely(dev_gso_segment(skb))) {
> +			kfree(skb);
> +			return 0;
> +		}
> +
> +		if (skb->next) {
> +			int count = 0;
> +
> +			do {
> +				struct sk_buff *nskb = skb->next;
> +
> +				skb->next = nskb->next;
> +				__skb_queue_tail(dev->skb_blist, nskb);
> +				count++;
> +			} while (skb->next);

Is it possible to move list without iterating over each entry?


-- 
	Evgeniy Polyakov

^ permalink raw reply

* [DOC] Net tx batching driver howto
From: jamal @ 2007-08-08 12:02 UTC (permalink / raw)
  To: netdev
  Cc: Krishna Kumar2, Evgeniy Polyakov, Jeff Garzik, Gagan Arneja,
	Leonid Grossman, Sridhar Samudrala, Rick Jones, Robert Olsson,
	David Miller, shemminger, kaber
In-Reply-To: <1181603345.4071.67.camel@localhost>

[-- Attachment #1: Type: text/plain, Size: 163 bytes --]

This is a small update to the howto.

The current working tree can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/hadi/batch-lin26.git

cheers,
jamal

[-- Attachment #2: batch-driver-howto.txt --]
[-- Type: text/plain, Size: 5671 bytes --]

Heres the begining of a howto for driver author.
The current working tree can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/hadi/batch-lin26.git

The intended audience for this howto is people already
familiar with netdevices.

0) Hardware Pre-requisites:
---------------------------

You must have at least hardware that is capable of doing
DMA with many descriptors; i.e having hardware with a queue
length of 3 (as in some fscked ethernet hardware) is not
very useful in this case.

1) What is new in the driver API:
---------------------------------

a) A new method called onto the driver by the net tx core to
batch packets. This method, dev->hard_batch_xmit(dev), 
is no different than dev->hard_start_xmit(dev) in terms of the 
arguements it takes. You just have to handle it differently 
(more below).

b) A new method, dev->hard_prep_xmit(), called onto the driver to 
massage the packet before it gets transmitted. 
This method is optional i.e if you dont specify it, you will
not be invoked(more below)

c) A new variable dev->xmit_win which provides suggestions to the
core calling into the driver a rough estimate of how many
packets can be batched onto the driver. 

2) Driver pre-requisite
------------------------

The typical driver tx state machine is:

----
--> +Core sends packets
    +--> Driver puts packet onto hardware queue
    +    if hardware queue is full, netif_stop_queue(dev)
    +
--> +core stops sending because of netif_stop_queue(dev)
..
.. time passes
..
..
--> +---> driver has transmitted packets, opens up tx path by
          invoking netif_wake_queue(dev)
--> +Core sends packets, and the cycle repeats.
----

The pre-requisite for batching changes is that the driver should 
provide a low threshold to open up the tx path.
This is a very important requirement in making batching useful.
Drivers such as tg3 and e1000 already do this.
So in the above annotation, as a driver author, before you
invoke netif_wake_queue(dev) you check if there are enough
entries left.

Heres an example of how i added it to tun driver
---
+#define NETDEV_LTT 4 /* the low threshold to open up the tx path */
..
..
	u32 t = skb_queue_len(&tun->readq);
	if (netif_queue_stopped(tun->dev) && t < NETDEV_LTT) {
		tun->dev->xmit_win = tun->dev->tx_queue_len;
		netif_wake_queue(tun->dev);
	}
---

Heres how the batching e1000 driver does it (ignore the setting of
netdev->xmit_win, more on this later):

--
if (unlikely(cleaned && netif_carrier_ok(netdev) &&
     E1000_DESC_UNUSED(tx_ring) >= TX_WAKE_THRESHOLD)) {

	if (netif_queue_stopped(netdev)) {
	       int rspace =  E1000_DESC_UNUSED(tx_ring) - (MAX_SKB_FRAGS +  2);
	       netdev->xmit_win = rspace;
	       netif_wake_queue(netdev);
       }
---

in tg3 code looks like:

-----
	if (netif_queue_stopped(tp->dev) &&
		(tg3_tx_avail(tp) > TG3_TX_WAKEUP_THRESH(tp)))
			netif_wake_queue(tp->dev);
---

3) Driver Setup:
-------------------

a) On initialization (before netdev registration)
 i) set NETIF_F_BTX in  dev->features 
  i.e dev->features |= NETIF_F_BTX
  This makes the core do proper initialization.

  ii) set dev->xmit_win to something reasonable like
  maybe half the tx DMA ring size etc.
  This is later used by the core to guess how much packets to send
  in one batch. 

  b) create proper pointer to the two new methods desribed above.

4) The new methods
--------------------

  a) The batching method

Heres an example of a batch tx routine that is similar
to the one i added to tun driver

----
  static int xxx_net_bxmit(struct net_device *dev)
  {
  ....
  ....
        while (skb_queue_len(dev->blist)) {
	        dequeue from dev->blist
		enqueue onto hardware ring
		if hardware ring full break
        }

	if (hardware ring full) {
		  netif_stop_queue(dev);
		  dev->xmit_win = 1;
	}

       if we queued on hardware, tell it to chew
       .......
       ..
       .
  }
------

All return codes like NETDEV_TX_OK etc still apply.
In this method, if there are any IO operations that apply to a 
set of packets (such as kicking DMA) leave them to the end and apply
them once if you have successfully enqueued. For an example of this
look e1000 driver e1000_kick_DMA() function.

b) The dev->hard_prep_xmit() method

The benefits of this method are described in an a separate document.

Use this method to only do pre-processing of the skb passed.
If in the current dev->hard_start_xmit() you are pre-processing
packets before holding any locks (eg formating them to be put in
any descriptor etc).
Look at e1000_prep_queue_frame() for an example.
You may use the skb->cb to store any state that you need to know
of later when batching.
PS: I have found when discussing with Michael Chan and Matt Carlson
that skb->cb[0] is used by the VLAN code to pass VLAN info to the driver.
I think this is a violation of the usage of the cb scratch pad. 
To work around this, you could use skb->cb[8] or do what the broadcom
tg3 bacthing driver does which is to glean the vlan info first then
re-use the skb->cb.

5) setting the dev->xmit_win 
-----------------------------

As mentioned earlier this variable provides hints on how much
data to send from the core to the driver. Some suggestions:
a)on doing a netif_stop, set it to 1
b)on netif_wake_queue set it to the max available space

The variable is important because it avoids the core sending
any more than what the driver can handle therefore avoiding 
any need to muck with packet scheduling mechanisms.

Appendix 1: History
-------------------
June 11: Initial revision
June 11: Fixed typo on e1000 netif_wake description ..
Aug  08: Added info on VLAN and the skb->cb[] danger ..

^ permalink raw reply

* Re: [PATCH RFC]: napi_struct V5
From: jamal @ 2007-08-08 12:10 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, shemminger, jgarzik, rusty
In-Reply-To: <20070807.175920.15265201.davem@davemloft.net>

On Tue, 2007-07-08 at 17:59 -0700, David Miller wrote:

> I am tempted to suggest we toss the document completely, for
> two reasons:

Its overdue - just hasnt been anybody who has raised their hands
to do it. Stephen did at one point. 
Theres a lot of nice insights in that doc that will be nice to inherit.
Theres also a nice state machine diagram in 
http://www.kernel.org/pub/linux/kernel/people/hadi/docs/UKUUG2005.pdf
that will be very useful to capture in whatever new doc.

cheers,
jamal

^ permalink raw reply

* [ofa-general] Re: [PATCH 3/9 Rev3] [sched] Modify qdisc_run to support batching
From: Evgeniy Polyakov @ 2007-08-08 12:14 UTC (permalink / raw)
  To: Krishna Kumar
  Cc: jagana, Robert.Olsson, herbert, gaagaan, kumarkr, rdreier,
	peter.p.waskiewicz.jr, mcarlson, kaber, jeff, hadi, general,
	mchan, tgraf, netdev, shemminger, davem, sri
In-Reply-To: <20070808093145.15396.91711.sendpatchset@localhost.localdomain>

On Wed, Aug 08, 2007 at 03:01:45PM +0530, Krishna Kumar (krkumar2@in.ibm.com) wrote:
> +static inline int get_skb(struct net_device *dev, struct Qdisc *q,
> +			  struct sk_buff_head *blist, struct sk_buff **skbp)
> +{
> +	if (likely(!blist || (!skb_queue_len(blist) && qdisc_qlen(q) <= 1))) {
> +		return likely((*skbp = dev_dequeue_skb(dev, q)) != NULL);
> +	} else {
> +		int max = dev->tx_queue_len - skb_queue_len(blist);
> +		struct sk_buff *skb;
> +
> +		while (max > 0 && (skb = dev_dequeue_skb(dev, q)) != NULL)
> +			max -= dev_add_skb_to_blist(skb, dev);
> +
> +		*skbp = NULL;
> +		return 1;	/* we have atleast one skb in blist */
> +	}
> +}

Same here - is it possible to get a list in one go instead of pulling
one-by-one, since it forces quite a few additional unneded lock
get/releases. What about dev_dequeue_number_skb(dev, q, num), which will 
grab the lock and move a list of skbs from one queue to provided head.

> @@ -158,7 +198,10 @@ static inline int qdisc_restart(struct n
>  	/* And release queue */
>  	spin_unlock(&dev->queue_lock);
>  
> -	ret = dev_hard_start_xmit(skb, dev);
> +	if (likely(skb))
> +		ret = dev_hard_start_xmit(skb, dev);
> +	else
> +		ret = dev->hard_start_xmit_batch(dev);

Perfectionism says that having array of two functions and calling one of
them via array_func_pointer[!!skb] will be much faster. Just a though.
It is actually much faster than if/else on x86 at least.

-- 
	Evgeniy Polyakov

^ permalink raw reply

* Re: 2.6.20->2.6.21 - networking dies after random time
From: Jarek Poplawski @ 2007-08-08 12:16 UTC (permalink / raw)
  To: Jean-Baptiste Vignaud
  Cc: cebbert, mingo, marcin.slusarz, tglx, torvalds, linux-kernel,
	shemminger, linux-net, netdev, akpm, alan
In-Reply-To: <JMG6AY$2339D4FEA804188F4C6D4F66006A6BB6@xandmail.com>

On Wed, Aug 08, 2007 at 10:59:22AM +0200, Jean-Baptiste Vignaud wrote:
...
> > If you would like to read something more about testing (then of
> > course my suggestions could occur invalid - I'm a very bad tester
> > myself...) you can try this:
> > http://www.stardust.webpages.pl/files/handbook/
> 
> I'll have a look at the document.

BTW: this document describes some methods for a kind of 'professional'
testing (so you could save time if you do it very often). But, you
shouldn't think all this knowledge or tools are necessary. So, you can
skip many such things and still do very valuable testing with simpler
methods. And there are a lot of simple & good advices as well.

BTW #2: this all testing of older versions, which I've described, has
of course any reason only if after present patches you'll still think
the older kernel had worked better for you.

Jarek P. 

^ permalink raw reply

* Re: [PATCH 2/14] nes: device structures and defines
From: Andi Kleen @ 2007-08-08 12:55 UTC (permalink / raw)
  To: Michael Buesch; +Cc: Andi Kleen, Jeff Garzik, ggrundstrom, rdreier, ewg, netdev
In-Reply-To: <200708081350.36081.mb@bu3sch.de>

On Wed, Aug 08, 2007 at 01:50:35PM +0200, Michael Buesch wrote:
> On Wednesday 08 August 2007 14:38:10 Andi Kleen wrote:
> > Jeff Garzik <jeff@garzik.org> writes:
> > > > +			val, reg_index, addr, addr+4); */
> > > > +	writel(cpu_to_le32(reg_index), addr);
> > > > +	writel(cpu_to_le32(val),(u8 *)addr + 4);
> > > 
> > > wrong -- endian conversion macros not needed with writel()
> > 
> > Are you sure? I don't think that's true. e.g. powerpc writel 
> > doesn't convert endian
> 
> Andi, you are wrong.
> writeX take CPU endian args.

That is what I wrote.

-Andi

^ permalink raw reply

* Re: [PATCH 2/14] nes: device structures and defines
From: Michael Buesch @ 2007-08-08 13:02 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Jeff Garzik, ggrundstrom, rdreier, ewg, netdev
In-Reply-To: <20070808125511.GA14419@one.firstfloor.org>

On Wednesday 08 August 2007 14:55:11 Andi Kleen wrote:
> On Wed, Aug 08, 2007 at 01:50:35PM +0200, Michael Buesch wrote:
> > On Wednesday 08 August 2007 14:38:10 Andi Kleen wrote:
> > > Jeff Garzik <jeff@garzik.org> writes:
> > > > > +			val, reg_index, addr, addr+4); */
> > > > > +	writel(cpu_to_le32(reg_index), addr);
> > > > > +	writel(cpu_to_le32(val),(u8 *)addr + 4);
> > > > 
> > > > wrong -- endian conversion macros not needed with writel()
> > > 
> > > Are you sure? I don't think that's true. e.g. powerpc writel 
> > > doesn't convert endian
> > 
> > Andi, you are wrong.
> > writeX take CPU endian args.
> 
> That is what I wrote.

Uhm, so....
Why did you complain to Jeff?
I'm a little bit confused now :)

Fact is that we do _not_ need cpu_to_le32 with writel.

-- 
Greetings Michael.

^ permalink raw reply

* [DOC] Net tx batching core evolution
From: jamal @ 2007-08-08 13:03 UTC (permalink / raw)
  To: netdev
  Cc: Krishna Kumar2, Evgeniy Polyakov, Jeff Garzik, Gagan Arneja,
	Leonid Grossman, Sridhar Samudrala, Rick Jones, Robert Olsson,
	David Miller, shemminger, kaber
In-Reply-To: <1186574575.5171.17.camel@localhost>

[-- Attachment #1: Type: text/plain, Size: 260 bytes --]

The attached doc describes the evolution of batching work that leads
towards the dev->hard_prep_xmit() api being introduced.

As usual the updated working tree can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/hadi/batch-lin26.git

cheers,
jamal

[-- Attachment #2: evolution-of-batch.txt --]
[-- Type: text/plain, Size: 5518 bytes --]

The code for this work can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/hadi/batch-lin26.git

The purpose of this doc is not to describe the batching work although
benefits of that approach could be gleaned from this doc (example
section 1.x should give the implicit value proposition).
There are more details on the rest of the batching work described in
the driver howto posted on netdev as well as a couple of presentations 
i have given in the past (refer to the last few slides of my netconf 2006 
slides for example).

The purpose of this doc is to describe the evolution of the work
work which leads to two important apis to the driver writer.
The first one is introduction of a method called 
dev->hard_prep_xmit() and the second api is the introduction
of variable dev->xmit_win

For the sake of clarity i will be using non-LLTX because it is easier 
to explain.
Note: The e1000, although non-LLTX has shown considerable improvement
with this approach as well as verified by experiments. (For a lot of
other reasons e1000 needs to be converted to be non-LLTX).

1.0 Classical approach 
------------------------

Lets start with classical approach of what a random driver will do

loop:
  1--core spin for qlock
  1a---dequeu packet
  1b---release qlock

  2--core spins for xmit_lock

    3--enter hardware xmit routine
    3a----format packet:
    --------e.g vlan, mss, shinfo, csum, descriptor count, etc
    -----------if there is something wrong free skb and return OK
    3b----do per chip specific checks
    -----------if there is something wrong free skb and return OK
    3c----all is good, now stash a packet into DMA ring.

  4-- release tx lock
  5-- if all ok (there are still packets, netif not stopped etc)
      continue, else break
end_loop:

In between #1 and #1b, another CPU contends for qlock
In between #3 and #4, another CPU contends for txlock

1.1 Challenge to classical approach:
-------------------------------------

The cost of grabbing/setting up a lock is not cheap.
Observation also: spinning CPUs is expensive because the utilization
of the compute cycles goes down. 
We assume the cost of dequeueing from the qdisc is not as expensive
as the enqueueing to DMA-ring.

1.2 Addressing the challenge to classical approach:
---------------------------------------------------

So we start with a simple premise to resolve the challenge above.
We try to amortize the cost by:
a) grabbing "as many" packets as we can between #1 and #1b with very 
little processing so we dont hold that lock for long.
b) then send "as many" as we can in between #3 and #4.

Lets start with a simple approach (which is what the batch code did 
in its earlier versions):

loop:
  1--core spin for qlock
    loop1: // "as many packets"
       1a---dequeu and enqueue on dev->blist
    end loop1:
  1b---release qlock

  2--core spins for xmit_lock

    3--enter hardware xmit routine
    loop2 ("for as many packets" or "no more ring space"):
       3a----format packet
       --------e.g vlan, mss, shinfo, csum, descriptor count, etc
       -----------if there is something wrong free skb and return OK (**)
       3b----do per chip specific checks
       -----------if there is something wrong free skb and return OK
       3c----all is good, now stash a packet into DMA ring.
    end_loop2:
       3d -- if you enqueued packets tell tx DMA to chew on them ...

  4-- release tx lock
  5-- if all ok (there are still packets, netif not stopped etc)
      continue;
end_loop

loop2 has the added side-benefit of improving the instruction cache 
warmness.
i.e If we can do things in a loop we would keep the instruction cache
warm for more packets (hence improve the hit rate/CPU-utilization).
However, the length of this loop may affect the hit-rate (very clear
to see as you use machines with bigger caches and qdisc_restart showing
in profiles). 
Note also: We have also amortized the cost of bus IO in step 3d 
because we make only one call after exiting the loop. 

2.0 New challenge
-----------------

A new challenge is introduced.
We could hold the tx lock for "long period" depending on how unclean
the driver path is. i.e imagine holding this lock for 100 packets.
So we need to find a balance.

We observe that within loop2, #3a doesnt need to be held within the
xmit lock. All it does is muck with an skb and may end up infact 
dropping it.

2.1 Addressing new challenge
-----------------------------

To address the new challenge, i introduced the dev->hard_prep_xmit() api. 
Essentially, dev->hard_prep_xmit() moves the packet formatting into #1
So now we change the code to be something along the lines of:

loop:
  1--core spin for qlock
    loop1: // "as many packets"
       1a---dequeu and enqueue on dev->blist
       1b--if driver has hard_prep_xmit, then format packet.
    end loop1:
  1c---release qlock

  2--core spins for xmit_lock

    3--enter hardware xmit routine
    loop2: ("for as many packets" or "no more ring space"):
       3a----do per chip specific checks
       -----------if there is something wrong free skb and return OK
       3b----all is good, now stash a packet into DMA ring.
    end_loop2:
       3c -- if you enqueued packets tell tx DMA to chew on them ...

  4-- release tx lock
  5-- if all ok (there are still packets, netif not stopped etc)
      continue;
end_loop

3.0 TBA:
--------
talk here about further optimizations added starting with xmit_win..

Appendix 1: HISTORY
---------------------
Aug 08/2007 - initial revision

^ permalink raw reply

* Re: TCP's initial cwnd setting correct?...
From: Ilpo Järvinen @ 2007-08-08 13:05 UTC (permalink / raw)
  To: David Miller; +Cc: Netdev
In-Reply-To: <20070807.220127.98862306.davem@davemloft.net>

[-- Attachment #1: Type: TEXT/PLAIN, Size: 994 bytes --]

On Tue, 7 Aug 2007, David Miller wrote:

> From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>
> Date: Mon, 6 Aug 2007 15:37:15 +0300 (EEST)
> 
> > ...Another thing that makes me wonder, is the tp->mss_cache > 1460 check 
> > as based on rfc3390 (and also it's precursor rfc2414) with up to 2190 
> > bytes MSS TCP can use 3 as initial cwnd...
> 
> I did the research and my memory was at least partially right.
> 
> Below is an old bogus change of mine and the later revert with
> Alexey's explanation.
> 
> This seems to be dealing with receive window calculation issues,
> rather than snd_cwnd.  But they might be related and you should
> consider this very seriously.

...Thanks about the info, I'll study it carefully and see what's
relevant in here too. And anyway we're currently on the safer side
as the potential issues only make it more conservative than the RFC 
allows, so there's no hurry to get it in until the solution is
acceptable if there's indeed a need for a change.

-- 
 i.

^ permalink raw reply

* Re: [PATCH 2/14] nes: device structures and defines
From: Andi Kleen @ 2007-08-08 13:08 UTC (permalink / raw)
  To: Michael Buesch; +Cc: Andi Kleen, Jeff Garzik, ggrundstrom, rdreier, ewg, netdev
In-Reply-To: <200708081502.35837.mb@bu3sch.de>

On Wed, Aug 08, 2007 at 03:02:35PM +0200, Michael Buesch wrote:
> On Wednesday 08 August 2007 14:55:11 Andi Kleen wrote:
> > On Wed, Aug 08, 2007 at 01:50:35PM +0200, Michael Buesch wrote:
> > > On Wednesday 08 August 2007 14:38:10 Andi Kleen wrote:
> > > > Jeff Garzik <jeff@garzik.org> writes:
> > > > > > +			val, reg_index, addr, addr+4); */
> > > > > > +	writel(cpu_to_le32(reg_index), addr);
> > > > > > +	writel(cpu_to_le32(val),(u8 *)addr + 4);
> > > > > 
> > > > > wrong -- endian conversion macros not needed with writel()
> > > > 
> > > > Are you sure? I don't think that's true. e.g. powerpc writel 
> > > > doesn't convert endian
> > > 
> > > Andi, you are wrong.
> > > writeX take CPU endian args.
> > 
> > That is what I wrote.
> 
> Uhm, so....
> Why did you complain to Jeff?

Because Jeff wrote the opposite.

> I'm a little bit confused now :)
> 
> Fact is that we do _not_ need cpu_to_le32 with writel.

We do on a big endian platform if the register is LE. I assume that's the case 
on this hardware.

-Andi

^ permalink raw reply

* Re: [RFC] stuff from tcp-2.6 partially merged to upcoming net-2.6.24?
From: Ilpo Järvinen @ 2007-08-08 13:13 UTC (permalink / raw)
  To: David Miller; +Cc: Netdev
In-Reply-To: <20070808.035505.110907605.davem@davemloft.net>

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2501 bytes --]

On Wed, 8 Aug 2007, David Miller wrote:

> From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>
> Date: Tue, 7 Aug 2007 16:19:58 +0300 (EEST)
> 
> > Do you have any suggestion how I should proceed? Or do you perhaps object 
> > such partial merge completely? ...I could try to come up with a cleaned up 
> > patch series which has original and their bug fix parts combined to a 
> > single patch per change (would provide cleaner history and shouldn't be 
> > very hard to do either)...
> 
> Thank you for doing the rebase.  I agree with you that we should
> seperate out as much of the non-rb-tree stuff as possible and put it
> into net-2.6.24

...It would also help a lot in case we at some point of time in the future 
decide to merge rb-tree stuff but then have to back up to have it merged 
in a cleaner state which is easier to revert than with cleanups 
combined... Besides once I get some additional bits done (haven't yet 
dared to do L-bit cascade surgery which will fix timedout loop entry 
bugs and allow dropping of retrans hint counter), long enough mm sit would 
be nice route to rbtree stuff since there it's getting at least bit more
versitile testing that we alone can put to it...

> I'll will try hard to make time to look at your rebase and try to move
> things forward. 

...I case you want to validate them, git-patch-id helps to exclude 
identical changes between before and after allowing you to spend more
time on things that required non-trivial resolution...

> I've been all-consumed by the NAPI work and a driver I've been writing 
> in the background for the past few weeks.

...I can try to help... I suggest that as first step we take all changes 
that do not cause any conflicts (I've already tried cherry-picks), only 
thing I'm not sure whether I should combine change+fix parts or keep 
them as they were in tcp-2.6 (I suggest we combine them but you may 
disagree..?). I can either post them as series or give you them in 
--pretty=oneline format if you want to cherry-pick them yourself (in 
that case having a common tree would help a bit as commit ids would
match as well then)).

However, at least highest_sack inclusion (no space loss as one hint 
counter can be then dropped and allows accurate fackets out which I've 
partially coded already) will need also SACK-block validator. But I
basically have to redo as the validator patch was sacktag reorganization 
based previously (I've already found couple of improvements too to it's
accuracy :-))...

-- 
 i.

^ permalink raw reply

* Re: [PATCH] drivers/net/wireless/wl3501_cs.c: remove redundant memset
From: John W. Linville @ 2007-08-08 13:20 UTC (permalink / raw)
  To: Mariusz Kozlowski
  Cc: acme-f8uhVLnGfZaxAyOMLChx1axOck334EZe, Jeff Garzik,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <200708080758.22319.m.kozlowski-NWF1p15JEu3VItvQsEIGlw@public.gmane.org>

On Wed, Aug 08, 2007 at 07:58:21AM +0200, Mariusz Kozlowski wrote:
> > Please send wireless patches to linux-wireless-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
> > and CC me.
> 
> Ok. Did you pick the patch up?

Yes, I have it.

John
-- 
John W. Linville
linville-2XuSBdqkA4R54TAoqtyWWQ@public.gmane.org

^ permalink raw reply

* Re: [PATCH 2/14] nes: device structures and defines
From: Michael Buesch @ 2007-08-08 13:28 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Jeff Garzik, ggrundstrom, rdreier, ewg, netdev
In-Reply-To: <20070808130828.GB14419@one.firstfloor.org>

On Wednesday 08 August 2007 15:08:28 Andi Kleen wrote:
> On Wed, Aug 08, 2007 at 03:02:35PM +0200, Michael Buesch wrote:
> > On Wednesday 08 August 2007 14:55:11 Andi Kleen wrote:
> > > On Wed, Aug 08, 2007 at 01:50:35PM +0200, Michael Buesch wrote:
> > > > On Wednesday 08 August 2007 14:38:10 Andi Kleen wrote:
> > > > > Jeff Garzik <jeff@garzik.org> writes:
> > > > > > > +			val, reg_index, addr, addr+4); */
> > > > > > > +	writel(cpu_to_le32(reg_index), addr);
> > > > > > > +	writel(cpu_to_le32(val),(u8 *)addr + 4);
> > > > > > 
> > > > > > wrong -- endian conversion macros not needed with writel()

> > Fact is that we do _not_ need cpu_to_le32 with writel.
> 
> We do on a big endian platform if the register is LE. I assume that's the case 
> on this hardware.

That is not true.
writeX does automatically convert to bus-endian.
Which, in case of the PCI bus, is little endian.
So if your register is LE (which it is most likely), you don't
need any conversion. If your register is BE (which I very much doubt),
then you need swab32().
In _no_ case you need cpu_to_xx().

-- 
Greetings Michael.

^ permalink raw reply

* Re: [1/1] Block device throttling [Re: Distributed storage.]
From: Evgeniy Polyakov @ 2007-08-08 13:28 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Daniel Phillips, netdev, linux-kernel, linux-fsdevel,
	Peter Zijlstra
In-Reply-To: <20070808101708.GA23815@2ka.mipt.ru>

On Wed, Aug 08, 2007 at 02:17:09PM +0400, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
> This throttling mechanism allows to limit maximum amount of queued bios 
> per physical device. By default it is turned off and old block layer 
> behaviour with unlimited number of bios is used. When turned on (queue
> limit is set to something different than -1U via blk_set_queue_limit()),
> generic_make_request() will sleep until there is room in the queue.
> number of bios is increased in generic_make_request() and reduced either
> in bio_endio(), when bio is completely processed (bi_size is zero), and
> recharged from original queue when new device is assigned to bio via
> blk_set_bdev(). All oerations are not atomic, since we do not care about
> precise number of bios, but a fact, that we are close or close enough to
> the limit.
> 
> Tested on distributed storage device - with limit of 2 bios it works
> slow :)

As addon I can cook up a patch to configure this via sysfs if needed.
Thoughts?

-- 
	Evgeniy Polyakov

^ permalink raw reply

* Re: [PATCH 2/14] nes: device structures and defines
From: Andi Kleen @ 2007-08-08 13:38 UTC (permalink / raw)
  To: Michael Buesch; +Cc: Andi Kleen, Jeff Garzik, ggrundstrom, rdreier, ewg, netdev
In-Reply-To: <200708081528.34116.mb@bu3sch.de>

On Wed, Aug 08, 2007 at 03:28:33PM +0200, Michael Buesch wrote:
> On Wednesday 08 August 2007 15:08:28 Andi Kleen wrote:
> > On Wed, Aug 08, 2007 at 03:02:35PM +0200, Michael Buesch wrote:
> > > On Wednesday 08 August 2007 14:55:11 Andi Kleen wrote:
> > > > On Wed, Aug 08, 2007 at 01:50:35PM +0200, Michael Buesch wrote:
> > > > > On Wednesday 08 August 2007 14:38:10 Andi Kleen wrote:
> > > > > > Jeff Garzik <jeff@garzik.org> writes:
> > > > > > > > +			val, reg_index, addr, addr+4); */
> > > > > > > > +	writel(cpu_to_le32(reg_index), addr);
> > > > > > > > +	writel(cpu_to_le32(val),(u8 *)addr + 4);
> > > > > > > 
> > > > > > > wrong -- endian conversion macros not needed with writel()
> 
> > > Fact is that we do _not_ need cpu_to_le32 with writel.
> > 
> > We do on a big endian platform if the register is LE. I assume that's the case 
> > on this hardware.
> 
> That is not true.
> writeX does automatically convert to bus-endian.
> Which, in case of the PCI bus, is little endian.
> So if your register is LE (which it is most likely), you don't
> need any conversion. If your register is BE (which I very much doubt),
> then you need swab32().
> In _no_ case you need cpu_to_xx().

Hmm, I checked a couple of BE architectures and none seem to convert explicitely.
But ok it's possible that their PCI bridges convert. I'll take your
word for it although it sounds somewhat dubious.

However if that's true then there are quite some drivers wrong:

% grep -r write[bwl]\(cpu_to *   | wc -l
57

-Andi


^ permalink raw reply

* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
From: Herbert Xu @ 2007-08-08 13:42 UTC (permalink / raw)
  To: David Miller
  Cc: johnpol, peter.p.waskiewicz.jr, jeff, gaagaan, Robert.Olsson,
	kumarkr, rdreier, mcarlson, jagana, hadi, general, mchan, tgraf,
	netdev, shemminger, kaber, sri
In-Reply-To: <20070808.034900.85820906.davem@davemloft.net>

On Wed, Aug 08, 2007 at 03:49:00AM -0700, David Miller wrote:
> 
> Not because I think it obviates your work, but rather because I'm
> curious, could you test a TSO-in-hardware driver converted to
> batching and see how TSO alone compares to batching for a pure
> TCP workload?

You could even lower the bar by disabling TSO and enabling
software GSO.

> I personally don't think it will help for that case at all as
> TSO likely does better job of coalescing the work _and_ reducing
> bus traffic as well as work in the TCP stack.

I agree.  I suspect the bulk of the effort is in getting
these skb's created and processed by the stack so that by
the time that they're exiting the qdisc there's not much
to be saved anymore.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH 2/14] nes: device structures and defines
From: Michael Buesch @ 2007-08-08 13:48 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Jeff Garzik, ggrundstrom, rdreier, ewg, netdev
In-Reply-To: <20070808133850.GE14419@one.firstfloor.org>

On Wednesday 08 August 2007 15:38:50 Andi Kleen wrote:
> On Wed, Aug 08, 2007 at 03:28:33PM +0200, Michael Buesch wrote:
> > On Wednesday 08 August 2007 15:08:28 Andi Kleen wrote:
> > > On Wed, Aug 08, 2007 at 03:02:35PM +0200, Michael Buesch wrote:
> > > > On Wednesday 08 August 2007 14:55:11 Andi Kleen wrote:
> > > > > On Wed, Aug 08, 2007 at 01:50:35PM +0200, Michael Buesch wrote:
> > > > > > On Wednesday 08 August 2007 14:38:10 Andi Kleen wrote:
> > > > > > > Jeff Garzik <jeff@garzik.org> writes:
> > > > > > > > > +			val, reg_index, addr, addr+4); */
> > > > > > > > > +	writel(cpu_to_le32(reg_index), addr);
> > > > > > > > > +	writel(cpu_to_le32(val),(u8 *)addr + 4);
> > > > > > > > 
> > > > > > > > wrong -- endian conversion macros not needed with writel()
> > 
> > > > Fact is that we do _not_ need cpu_to_le32 with writel.
> > > 
> > > We do on a big endian platform if the register is LE. I assume that's the case 
> > > on this hardware.
> > 
> > That is not true.
> > writeX does automatically convert to bus-endian.
> > Which, in case of the PCI bus, is little endian.
> > So if your register is LE (which it is most likely), you don't
> > need any conversion. If your register is BE (which I very much doubt),
> > then you need swab32().
> > In _no_ case you need cpu_to_xx().
> 
> Hmm, I checked a couple of BE architectures and none seem to convert explicitely.

That depends on the arch.
Look at this from ppc, for example:

184 static inline void writel(__u32 b, volatile void __iomem *addr)
185 {
186         out_le32(addr, b);
187 }

out_le32 writes a little endian value. So it converts it.
Also note that b is __u32. Sparse would complain, if someone used cpu_to_xx
on it.

> But ok it's possible that their PCI bridges convert. I'll take your
> word for it although it sounds somewhat dubious.
>
> However if that's true then there are quite some drivers wrong:
> 
> % grep -r write[bwl]\(cpu_to *   | wc -l
> 57

Yeah, seems so.
Here comes an example (with 16bit values)

Little endian register
	LittleEndian arch		BigEngian arch
value	0xBBAA				0xAABB
writew	0xBBAA				0xAABB
on bus	0xBBAA				0xBBAA
on dev	0xBBAA				0xBBAA

Big endian register
	LittleEndian arch		BigEngian arch
value	0xBBAA				0xAABB
swab16	0xAABB				0xBBAA
writew	0xAABB				0xBBAA
on bus	0xAABB				0xAABB
on dev	0xAABB				0xAABB

I hope I got it right. :)
Same result on every arch.

Most hardware has little endian registers. Some can switch
endianess based on some bit, too. So in case of a BE register that
bit has to be flipped, or if not possible swabX() has to be used.

-- 
Greetings Michael.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox