Netdev List
 help / color / mirror / Atom feed
* Re: Regression 2.6.36 - driver rtl8169 crashes kernel, triggered by user app
From: Francois Romieu @ 2010-11-20  0:35 UTC (permalink / raw)
  To: Michael Monnerie; +Cc: linux-kernel, netdev
In-Reply-To: <201011181549.08545@zmi.at>

Michael Monnerie <michael.monnerie@is.it-management.at> :
> I did not get any answer, so here again:

I already answered one week ago, you were Cc:ed and I had my
answer delivered through vger. Please check your spambox.

Can you test again after reverting 801e147cde02f04b5c2f42764cd43a89fc7400a2 ?

-- 
Ueimor

^ permalink raw reply

* Fail Transfer of Large Files
From: Michael D. Berger @ 2010-11-19 23:45 UTC (permalink / raw)
  To: netdev

On my intranet, I sometimes transfer large files, about 4G,
to an CentOS old box that I use for a web server.  I transfer
with ftp or sftp.  Usually, before the file is complete, the
transfer "stalls".  At that point, ping from the destination box
to the router fails.  I then deactivate the net interface on the
destination box and then activate it.  Ping is then successful,
and the transfer is completed.  The transferred file is correct,
as verified with sha1sum.

All connections are via cat6 wire.

So what do you think?  Should I try changing the net card?
Any tests to run? Any other suggestions?

Thanks for your help.

Mike.


^ permalink raw reply

* Re: [PATCH v2] of/phylib: Use device tree properties to initialize Marvell PHYs.
From: David Daney @ 2010-11-19 23:38 UTC (permalink / raw)
  To: Andy Fleming
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	devicetree-discuss-uLR06cmDAlY/bJ5BZ2RsiQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Cyril Chemparathy,
	Arnaud Patard
In-Reply-To: <AANLkTikuikWgLbW67m1Ov0dt13A=DTSF150aTOs=PybT-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On 11/19/2010 03:02 PM, Andy Fleming wrote:
> On Fri, Nov 19, 2010 at 4:13 PM, David Daney<ddaney-M3mlKVOIwJVv6pq1l3V1OdBPR1lH4CV8@public.gmane.org>  wrote:
>> Some aspects of PHY initialization are board dependent, things like
>> indicator LED connections and some clocking modes cannot be determined
>> by probing.  The dev_flags element of struct phy_device can be used to
>> control these things if an appropriate value can be passed from the
>> Ethernet driver.  We run into problems however if the PHY connections
>> are specified by the device tree.  There is no way for the Ethernet
>> driver to know what flags it should pass.
>>
>> If we are using the device tree, the struct phy_device will be
>> populated with the device tree node corresponding to the PHY, and we
>> can extract extra configuration information from there.
>>
>> The next question is what should the format of that information be?
>> It is highly device specific, and the device tree representation
>> should not be tied to any arbitrary kernel defined constants.  A
>> straight forward representation is just to specify the exact bits that
>> should be set using the "marvell,reg-init" property:
>>
>>       phy5: ethernet-phy@5 {
>>         reg =<5>;
>>         compatible = "marvell,88e1149r";
>>         marvell,reg-init =
>>                 /* led[0]:1000, led[1]:100, led[2]:10, led[3]:tx */
>>                 <3 0x10 0 0x5777>, /* Reg 3,16<- 0x5777 */
>>                 /* mix %:0, led[0123]:drive low off hiZ */
>>                 <3 0x11 0 0x00aa>, /* Reg 3,17<- 0x00aa */
>>                 /* default blink periods. */
>>                 <3 0x12 0 0x4105>, /* Reg 3,18<- 0x4105 */
>>                 /* led[4]:rx, led[5]:dplx, led[45]:drive low off hiZ */
>>                 <3 0x13 0 0x0a60>; /* Reg 3,19<- 0x0a60 */
>
>
> My inclination is to shy away from such hard-coded values.  I agreed
> with Grant's comment about separating into separate cells, but I think
> specification of hard-coded register writes is too rigid.  I think a
> better approach would be to specify configuration attributes, like:
>
> marvell,blink-periods =<blah>;
> marvell,led-config =<[drive type] [indicates]>;

I considered doing that, but rejected the idea because it is so complex. 
  There are literally probably a hundred different register fields with 
widths varying between 1 and 16 bits.  The number of settings that the 
driver might want to deal with are astronomical.  For any given PHY 
model there could be 30 different settings that you might want to tweak. 
  In the example above I configure 5 LEDs with two parameters each, and 
about 4 or 5 more blink settings, that is something like 15 register fields.

Take the marvell.c file.  It currently supports nine different PHY 
models, each would have a bunch of model specific settings, the 
complexity of decoding all that is formidable (and would be a lot more 
code) also there would invariably be settings that were not implemented 
in the first version, so people would have to repeatedly extend it.

My patch adds 550 bytes (not including the string storage) of code on 
MIPS64, and can handle many cases without any changes.

>
> For one, I always advocate making the DTS human-readable.  For
> another, I think that there are a number of configuration sequences
> that require read-modify-write, or write-wait-write.  In other words,
> I think that there are enough cases where actual software will be
> needed, that an attempt to generically specify a register
> initialization sequence will be impossible, and leave us with the same
> problems to solve later on.

We can always add more properties as you suggest, if the existing code 
does not prove to be sufficiently flexible.

>  For third...ly... allowing
> device-tree-specified register initializations might encourage
> developers to put all of their register initializations in the device
> tree.  Especially when they realize that the LED initialization for
> *their* PHY has to come between two standard initialization steps in
> the driver.  Or before.  Or after.
>
> By specifying actual functionality, the driver can work around those
> problems, while being aware of the functional goal.  However, I'm
> aware that such a path is more difficult, and perhaps just as futile,
> as PHY vendors frequently don't want to document what their magic
> sequences mean.
>

I certainly understand your desire to make the dts readable, but I think 
it has to be balanced against the complexity and chance for bugs in the 
drivers as well as churn in the bindings specifications.

David Daney

^ permalink raw reply

* Re: 2.6.37-rc2-git4: Reported regressions 2.6.35 -> 2.6.36
From: Alex Deucher @ 2010-11-19 23:17 UTC (permalink / raw)
  To: Mark Lord
  Cc: Rafael J. Wysocki, Linux SCSI List, Linux ACPI,
	Network Development, Linux Wireless List,
	Linux Kernel Mailing List, DRI, Florian Mickler, Andrew Morton,
	Kernel Testers List, Linus Torvalds, Linux PM List,
	Maciej Rutecki
In-Reply-To: <4CE70543.2000206@teksavvy.com>

On Fri, Nov 19, 2010 at 6:16 PM, Mark Lord <kernel@teksavvy.com> wrote:
> On 10-11-19 05:58 PM, Alex Deucher wrote:
>>
>> On Fri, Nov 19, 2010 at 5:55 PM, Mark Lord<kernel@teksavvy.com>  wrote:
>>
>>> It now comes back at resume time.
>>
>> So that patch helped?
>
> I think so.  It didn't used to resume from suspend with 2.6.36, and now it
> does.
>
>>> But suffers long delays (also sometimes with 2.6.35) doing this:
>>>
>>> [drm:atom_op_jump] *ERROR* atombios stuck in loop for more than 5secs
>>> aborting
>>> [drm:atom_execute_table_locked] *ERROR* atombios stuck executing E576
>>> (len
>>> 105, WS 12, PS 8) @ 0xE5C4
>>> [drm:atom_op_jump] *ERROR* atombios stuck in loop for more than 5secs
>>> aborting
>>> [drm:atom_execute_table_locked] *ERROR* atombios stuck executing ECD2
>>> (len
>>> 86, WS 4, PS 0) @ 0xED05
>>> [drm:atom_op_jump] *ERROR* atombios stuck in loop for more than 5secs
>>> aborting
>>> [drm:atom_execute_table_locked] *ERROR* atombios stuck executing E576
>>> (len
>>> 105, WS 12, PS 8) @ 0xE5C4
>>> PM: resume of devices complete after 15718.253 msecs
>>>
>>
>> It's be nice if you could bisect to track down when those started.
>
> It'd be even nicer if they hadn't started.  :)
>
> What kernel release first had that atom/ops table in it?  I'll try that.
>

2.6.31 or 32 was when radeon kms got merged IIRC.

Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: 2.6.37-rc2-git4: Reported regressions 2.6.35 -> 2.6.36
From: Mark Lord @ 2010-11-19 23:16 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Rafael J. Wysocki, Linux SCSI List, Linux ACPI,
	Network Development, Linux Wireless List,
	Linux Kernel Mailing List, DRI, Florian Mickler, Andrew Morton,
	Kernel Testers List, Linus Torvalds, Linux PM List,
	Maciej Rutecki
In-Reply-To: <AANLkTim9aXi1_EJyh2f-+X4HnNWJ3uk2z=P6m1hQso5a@mail.gmail.com>

On 10-11-19 05:58 PM, Alex Deucher wrote:
> On Fri, Nov 19, 2010 at 5:55 PM, Mark Lord<kernel@teksavvy.com>  wrote:
>
>> It now comes back at resume time.
>
> So that patch helped?

I think so.  It didn't used to resume from suspend with 2.6.36, and now it does.

>> But suffers long delays (also sometimes with 2.6.35) doing this:
>>
>> [drm:atom_op_jump] *ERROR* atombios stuck in loop for more than 5secs
>> aborting
>> [drm:atom_execute_table_locked] *ERROR* atombios stuck executing E576 (len
>> 105, WS 12, PS 8) @ 0xE5C4
>> [drm:atom_op_jump] *ERROR* atombios stuck in loop for more than 5secs
>> aborting
>> [drm:atom_execute_table_locked] *ERROR* atombios stuck executing ECD2 (len
>> 86, WS 4, PS 0) @ 0xED05
>> [drm:atom_op_jump] *ERROR* atombios stuck in loop for more than 5secs
>> aborting
>> [drm:atom_execute_table_locked] *ERROR* atombios stuck executing E576 (len
>> 105, WS 12, PS 8) @ 0xE5C4
>> PM: resume of devices complete after 15718.253 msecs
>>
>
> It's be nice if you could bisect to track down when those started.

It'd be even nicer if they hadn't started.  :)

What kernel release first had that atom/ops table in it?  I'll try that.

^ permalink raw reply

* Re: [PATCH v2] of/phylib: Use device tree properties to initialize Marvell PHYs.
From: Andy Fleming @ 2010-11-19 23:02 UTC (permalink / raw)
  To: David Daney
  Cc: devicetree-discuss, grant.likely, linux-kernel, netdev,
	Cyril Chemparathy, Arnaud Patard, Benjamin Herrenschmidt
In-Reply-To: <1290204798-28569-1-git-send-email-ddaney@caviumnetworks.com>

On Fri, Nov 19, 2010 at 4:13 PM, David Daney <ddaney@caviumnetworks.com> wrote:
> Some aspects of PHY initialization are board dependent, things like
> indicator LED connections and some clocking modes cannot be determined
> by probing.  The dev_flags element of struct phy_device can be used to
> control these things if an appropriate value can be passed from the
> Ethernet driver.  We run into problems however if the PHY connections
> are specified by the device tree.  There is no way for the Ethernet
> driver to know what flags it should pass.
>
> If we are using the device tree, the struct phy_device will be
> populated with the device tree node corresponding to the PHY, and we
> can extract extra configuration information from there.
>
> The next question is what should the format of that information be?
> It is highly device specific, and the device tree representation
> should not be tied to any arbitrary kernel defined constants.  A
> straight forward representation is just to specify the exact bits that
> should be set using the "marvell,reg-init" property:
>
>      phy5: ethernet-phy@5 {
>        reg = <5>;
>        compatible = "marvell,88e1149r";
>        marvell,reg-init =
>                /* led[0]:1000, led[1]:100, led[2]:10, led[3]:tx */
>                <3 0x10 0 0x5777>, /* Reg 3,16 <- 0x5777 */
>                /* mix %:0, led[0123]:drive low off hiZ */
>                <3 0x11 0 0x00aa>, /* Reg 3,17 <- 0x00aa */
>                /* default blink periods. */
>                <3 0x12 0 0x4105>, /* Reg 3,18 <- 0x4105 */
>                /* led[4]:rx, led[5]:dplx, led[45]:drive low off hiZ */
>                <3 0x13 0 0x0a60>; /* Reg 3,19 <- 0x0a60 */


My inclination is to shy away from such hard-coded values.  I agreed
with Grant's comment about separating into separate cells, but I think
specification of hard-coded register writes is too rigid.  I think a
better approach would be to specify configuration attributes, like:

marvell,blink-periods = <blah>;
marvell,led-config = <[drive type] [indicates]>;

For one, I always advocate making the DTS human-readable.  For
another, I think that there are a number of configuration sequences
that require read-modify-write, or write-wait-write.  In other words,
I think that there are enough cases where actual software will be
needed, that an attempt to generically specify a register
initialization sequence will be impossible, and leave us with the same
problems to solve later on.  For third...ly... allowing
device-tree-specified register initializations might encourage
developers to put all of their register initializations in the device
tree.  Especially when they realize that the LED initialization for
*their* PHY has to come between two standard initialization steps in
the driver.  Or before.  Or after.

By specifying actual functionality, the driver can work around those
problems, while being aware of the functional goal.  However, I'm
aware that such a path is more difficult, and perhaps just as futile,
as PHY vendors frequently don't want to document what their magic
sequences mean.

Andy

^ permalink raw reply

* Re: 2.6.37-rc2-git4: Reported regressions 2.6.35 -> 2.6.36
From: Alex Deucher @ 2010-11-19 22:58 UTC (permalink / raw)
  To: Mark Lord
  Cc: Rafael J. Wysocki, Linux SCSI List, Linux ACPI,
	Network Development, Linux Wireless List,
	Linux Kernel Mailing List, DRI, Florian Mickler, Andrew Morton,
	Kernel Testers List, Linus Torvalds, Linux PM List,
	Maciej Rutecki
In-Reply-To: <4CE7006E.4040102@teksavvy.com>

On Fri, Nov 19, 2010 at 5:55 PM, Mark Lord <kernel@teksavvy.com> wrote:
> On 10-11-19 11:39 AM, Alex Deucher wrote:
>>
>> On Thu, Nov 18, 2010 at 7:47 PM, Mark Lord<kernel@teksavvy.com>  wrote:
>>
>>> My non-Intel graphics notebook (has ATI X1400 graphics) also has a resume
>>> regression with 2.6.36.  But it does work fine with 2.6.35 (and earlier,
>>> back many years).  As a result, I'm stuck with 2.6.35 for the time being,
>>> and lack the time for a concerted debug effort on 2.6.36+ right now.
>>>
>>
>> Can you bisect?  Does this patch help?
>>
>> diff --git a/drivers/gpu/drm/radeon/atom.c b/drivers/gpu/drm/radeon/atom.c
>> index 8e421f6..05efb5b 100644
>> --- a/drivers/gpu/drm/radeon/atom.c
>> +++ b/drivers/gpu/drm/radeon/atom.c
>> @@ -112,6 +112,7 @@ static uint32_t atom_iio_execute(struct
>> atom_context *ctx, int base,
>>                         base += 3;
>>                         break;
>>                 case ATOM_IIO_WRITE:
>> +                       (void)ctx->card->ioreg_read(ctx->card, CU16(base +
>> 1));
>>                         ctx->card->ioreg_write(ctx->card, CU16(base + 1),
>> temp);
>>                         base += 3;
>>                         break;
>
> It now comes back at resume time.

So that patch helped?

>
> But suffers long delays (also sometimes with 2.6.35) doing this:
>
> [drm:atom_op_jump] *ERROR* atombios stuck in loop for more than 5secs
> aborting
> [drm:atom_execute_table_locked] *ERROR* atombios stuck executing E576 (len
> 105, WS 12, PS 8) @ 0xE5C4
> [drm:atom_op_jump] *ERROR* atombios stuck in loop for more than 5secs
> aborting
> [drm:atom_execute_table_locked] *ERROR* atombios stuck executing ECD2 (len
> 86, WS 4, PS 0) @ 0xED05
> [drm:atom_op_jump] *ERROR* atombios stuck in loop for more than 5secs
> aborting
> [drm:atom_execute_table_locked] *ERROR* atombios stuck executing E576 (len
> 105, WS 12, PS 8) @ 0xE5C4
> PM: resume of devices complete after 15718.253 msecs
>

It's be nice if you could bisect to track down when those started.

> So I did this (local hack only, obviously NOT for mainline) to work around
> that issue:
>
> --- linux-2.6.36/drivers/gpu/drm/radeon/atom.c  2010-10-20
> 16:30:22.000000000 -0400
> +++ linux/drivers/gpu/drm/radeon/atom.c 2010-11-19 17:14:21.141807003 -0500
> @@ -1150,6 +1151,7 @@
>
>        if (!base)
>                return -EINVAL;
> +       if (base == 0xe576 || base == 0xecd2) return 0;  /* prevent freezes
> on Dell i9400 w/X1400 */
>
>        len = CU16(base + ATOM_CT_SIZE_PTR);
>        ws = CU8(base + ATOM_CT_WS_PTR);
>
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: 2.6.37-rc2-git4: Reported regressions 2.6.35 -> 2.6.36
From: Mark Lord @ 2010-11-19 22:55 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Rafael J. Wysocki, Linux SCSI List, Linux ACPI,
	Network Development, Linux Wireless List,
	Linux Kernel Mailing List, DRI, Florian Mickler, Andrew Morton,
	Kernel Testers List, Linus Torvalds, Linux PM List,
	Maciej Rutecki
In-Reply-To: <AANLkTi=8G7pNA7QTxaXDWKrWrCERPfmKU+nLJWCA=JUH-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On 10-11-19 11:39 AM, Alex Deucher wrote:
> On Thu, Nov 18, 2010 at 7:47 PM, Mark Lord<kernel-R6A+fiHC8nRWk0Htik3J/w@public.gmane.org>  wrote:
>
>> My non-Intel graphics notebook (has ATI X1400 graphics) also has a resume
>> regression with 2.6.36.  But it does work fine with 2.6.35 (and earlier,
>> back many years).  As a result, I'm stuck with 2.6.35 for the time being,
>> and lack the time for a concerted debug effort on 2.6.36+ right now.
>>
>
> Can you bisect?  Does this patch help?
>
> diff --git a/drivers/gpu/drm/radeon/atom.c b/drivers/gpu/drm/radeon/atom.c
> index 8e421f6..05efb5b 100644
> --- a/drivers/gpu/drm/radeon/atom.c
> +++ b/drivers/gpu/drm/radeon/atom.c
> @@ -112,6 +112,7 @@ static uint32_t atom_iio_execute(struct
> atom_context *ctx, int base,
>                          base += 3;
>                          break;
>                  case ATOM_IIO_WRITE:
> +                       (void)ctx->card->ioreg_read(ctx->card, CU16(base + 1));
>                          ctx->card->ioreg_write(ctx->card, CU16(base + 1), temp);
>                          base += 3;
>                          break;

It now comes back at resume time.

But suffers long delays (also sometimes with 2.6.35) doing this:

[drm:atom_op_jump] *ERROR* atombios stuck in loop for more than 5secs aborting
[drm:atom_execute_table_locked] *ERROR* atombios stuck executing E576 (len 105, 
WS 12, PS 8) @ 0xE5C4
[drm:atom_op_jump] *ERROR* atombios stuck in loop for more than 5secs aborting
[drm:atom_execute_table_locked] *ERROR* atombios stuck executing ECD2 (len 86, 
WS 4, PS 0) @ 0xED05
[drm:atom_op_jump] *ERROR* atombios stuck in loop for more than 5secs aborting
[drm:atom_execute_table_locked] *ERROR* atombios stuck executing E576 (len 105, 
WS 12, PS 8) @ 0xE5C4
PM: resume of devices complete after 15718.253 msecs

So I did this (local hack only, obviously NOT for mainline) to work around that 
issue:

--- linux-2.6.36/drivers/gpu/drm/radeon/atom.c  2010-10-20 16:30:22.000000000 -0400
+++ linux/drivers/gpu/drm/radeon/atom.c 2010-11-19 17:14:21.141807003 -0500
@@ -1150,6 +1151,7 @@

         if (!base)
                 return -EINVAL;
+       if (base == 0xe576 || base == 0xecd2) return 0;  /* prevent freezes on 
Dell i9400 w/X1400 */

         len = CU16(base + ATOM_CT_SIZE_PTR);
         ws = CU8(base + ATOM_CT_WS_PTR);

^ permalink raw reply

* Re: Generalizing mmap'ed sockets
From: Tom Herbert @ 2010-11-19 22:49 UTC (permalink / raw)
  To: David Miller; +Cc: rick.jones2, netdev
In-Reply-To: <20101119.140818.242132853.davem@davemloft.net>

> I think the ACK (or for UDP, the kfree_skb() after TX completes) should
> move the consumer pointer.  Otherwise you have to copy, and the ACKs
> do not clock the sender process properly.
>
Right, with the caveats that even ACK'ed data might still go out on
the with that was discussed in the vmsplice() related patches.  I
don't think this should make the problem any worse.

> But you do bring up an interesting point about TX buffer space sizing.
>
> This whole scheme currently seems to completely ignore buffer size
> auto-tuning done by TCP, and that won't fly I think. :-)
>
> The whole point is to make it so that applications do not need to know
> about that aspect of buffering at all.  With the current mmap design
> we're back to the stone ages where the app essentially has to pick an
> explicit send buffer size.

True, and I would never say that this is suitable replacement for all
TCP transmit.  However, there are specialized applications where this
could be applied.  Note that the buffer is not just a kernel buffer,
it is also an application visible buffer with more purpose than just
buffering data for transmit.  For instance, an application using RPC
could assemble it's message directly into this buffer (which is cool).
 The obvious alternative would be to malloc a buffer and then just use
vmsplice(), either way the application will be allocating a send
buffer for its work.

Tom

^ permalink raw reply

* Re: Generalizing mmap'ed sockets
From: Rick Jones @ 2010-11-19 22:47 UTC (permalink / raw)
  To: David Miller; +Cc: therbert, netdev
In-Reply-To: <20101119.140818.242132853.davem@davemloft.net>

David Miller wrote:
> From: Rick Jones <rick.jones2@hp.com>
> Date: Fri, 19 Nov 2010 13:58:21 -0800
> 
> 
>>David Miller wrote:
>>
>>>From: Rick Jones <rick.jones2@hp.com>
>>>Date: Fri, 19 Nov 2010 13:32:57 -0800
>>>
>>>
>>>>I suppose then one would be able to track the consumer pointer (on tx)
>>>>to "know" that certain data had been ACKed by the remote?  For TCP
>>>>anyway - and assuming there wouldn't be a case where TCP might copy
>>>>the data out of the ring and assert "completion."
>>>
>>>Yes, that's implicit in his design, the kernel manages the consumer
>>>pointer in the ring and this is how userspace can see when ring
>>>entries
>>>are reusable.
>>
>>But does one really want to lock-in that the update to the consumer
>>pointer means the data has been ACKed by the remote (or I suppose that
>>DMA have completed if it were UDP)?
> 
> 
> I think the ACK (or for UDP, the kfree_skb() after TX completes) should
> move the consumer pointer.  Otherwise you have to copy, and the ACKs
> do not clock the sender process properly.

I'm not worried about the ACK/kfree_skb() moving the pointer.  I'm simply 
worried about what the application should infer from the pointer's movement. 
That is, if the design is documented  "Movement of the consumer pointer implies 
that the corresponding data has been ACKed by the remote TCP" that is locking 
the design into a semantic I don't know that it will always want to maintain, 
because there may end-up being some cases where the stack might indeed want to 
copy and so not maintain that "pointer update means the remote TCP has the data" 
semantic.

> But you do bring up an interesting point about TX buffer space sizing.
> 
> This whole scheme currently seems to completely ignore buffer size
> auto-tuning done by TCP, and that won't fly I think. :-)
> 
> The whole point is to make it so that applications do not need to know
> about that aspect of buffering at all.  With the current mmap design
> we're back to the stone ages where the app essentially has to pick an
> explicit send buffer size.

In some ways, the stone ages were nicer :)

What if... :)  the stack had a way to communicate to the application that it 
wanted to change the effective socket buffer size?  If that is indeed 
sufficiently infrequent, perhaps a "signal the new size and the app does a fresh 
mmap()" mechanism would suffice. The app would, I presume need to first wait for 
the existing ring to drain, which could cause some complications I suppose.  Is 
there a way to flip the sense and have the kernel allocate the ring(s) and 
communicate that to the application?

But doesn't the whole idea of having an explicitly mmap()ed area of memory fly 
in the face of autotuning to begin with?  (Mind you, I've not always been a fan 
of autotuning as some of my previous "Why is it growing the window so large?!?" 
will attest :)  It is suggesting that the application has some "communications 
memory" (that it won't be itself copying to/from) and presumably knows or thinks 
it knows how much of that it needs.  For all we know, Tom is thinking that this 
mmap()ed region of memory will be rather larger than the maximum autotuned 
socket buffer sizes in the first place.  Going back to his initial email I don't 
see anything that explicitly describes the relationship between the size of this 
mmap()'ed region and the socket buffer sizes - I was just ass-u-me-ing it would 
set them.  Sure, it would have to be an effective upper bound for copy-less 
transmit and receive, but there is nothing that says the windows TCP is using 
have to be that large.

rick

^ permalink raw reply

* Re: [PATCH 2.6.37-rc1] net-next: Add multiqueue support to vmxnet3 driver
From: David Miller @ 2010-11-19 22:24 UTC (permalink / raw)
  To: sbhatewara; +Cc: bhutchings, shemminger, netdev, pv-drivers, linux-kernel
In-Reply-To: <alpine.LRH.2.00.1011191252070.22641@sbhatewara-dev1.eng.vmware.com>

From: Shreyas Bhatewara <sbhatewara@vmware.com>
Date: Fri, 19 Nov 2010 12:55:24 -0800 (PST)

> From: Shreyas Bhatewara <sbhatewara@vmware.com>
> 
> Add multiqueue support to vmxnet3 driver
> 
> This change adds multiqueue and thus receive side scaling support
> to vmxnet3 device driver. Number of rx queues is limited to 1 in cases
> where MSI is not configured or one MSIx vector is not available per rx
> queue
> 
> Signed-off-by: Shreyas Bhatewara <sbhatewara@vmware.com>
> Reviewed-by: Bhavesh Davda <bhavesh@vmware.com>

Applied, but I still had to fix something up:

> @@ -2726,3 +3277,5 @@ MODULE_AUTHOR("VMware, Inc.");
>  MODULE_DESCRIPTION(VMXNET3_DRIVER_DESC);
>  MODULE_LICENSE("GPL v2");
>  MODULE_VERSION(VMXNET3_DRIVER_VERSION_STRING);
> +
> +

Extraneous blank lines added to the end of a file are considered
errors by GIT, so I removed this hunk.

^ permalink raw reply

* [PATCH v2] of/phylib: Use device tree properties to initialize Marvell PHYs.
From: David Daney @ 2010-11-19 22:13 UTC (permalink / raw)
  To: devicetree-discuss-uLR06cmDAlY/bJ5BZ2RsiQ,
	grant.likely-s3s/WqlpOiPyB63q8FvJNQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: David Daney, Arnaud Patard, Cyril Chemparathy

Some aspects of PHY initialization are board dependent, things like
indicator LED connections and some clocking modes cannot be determined
by probing.  The dev_flags element of struct phy_device can be used to
control these things if an appropriate value can be passed from the
Ethernet driver.  We run into problems however if the PHY connections
are specified by the device tree.  There is no way for the Ethernet
driver to know what flags it should pass.

If we are using the device tree, the struct phy_device will be
populated with the device tree node corresponding to the PHY, and we
can extract extra configuration information from there.

The next question is what should the format of that information be?
It is highly device specific, and the device tree representation
should not be tied to any arbitrary kernel defined constants.  A
straight forward representation is just to specify the exact bits that
should be set using the "marvell,reg-init" property:

      phy5: ethernet-phy@5 {
        reg = <5>;
        compatible = "marvell,88e1149r";
        marvell,reg-init =
                /* led[0]:1000, led[1]:100, led[2]:10, led[3]:tx */
                <3 0x10 0 0x5777>, /* Reg 3,16 <- 0x5777 */
                /* mix %:0, led[0123]:drive low off hiZ */
                <3 0x11 0 0x00aa>, /* Reg 3,17 <- 0x00aa */
                /* default blink periods. */
                <3 0x12 0 0x4105>, /* Reg 3,18 <- 0x4105 */
                /* led[4]:rx, led[5]:dplx, led[45]:drive low off hiZ */
                <3 0x13 0 0x0a60>; /* Reg 3,19 <- 0x0a60 */
      };

      phy6: ethernet-phy@6 {
        reg = <6>;
        compatible = "marvell,88e1118";
        marvell,reg-init =
                /* Fix rx and tx clock transition timing */
                <2 0x15 0xffcf 0>, /* Reg 2,21 Clear bits 4, 5 */
                /* Adjust LED drive. */
                <3 0x11 0 0x442a>, /* Reg 3,17 <- 0442a */
                /* irq, blink-activity, blink-link */
                <3 0x10 0 0x0242>; /* Reg 3,16 <- 0x0242 */
      };

The Marvell PHYs have a page select register at register 22 (0x16), we
can specify any register by its page and register number.  These are
the first and second word.  The third word contains a mask to be ANDed
with the existing register value, and the fourth word is ORed with the
result to yield the new register value.  The new marvell_of_reg_init
function leaves the page select register unchanged, so a call to it
can be dropped into the .config_init functions without unduly
affecting the state of the PHY.

If CONFIG_OF_MDIO is not set, there is no of_node, or no
"marvell,reg-init" property, the PHY initialization is unchanged.

Signed-off-by: David Daney <ddaney-M3mlKVOIwJVv6pq1l3V1OdBPR1lH4CV8@public.gmane.org>
Cc: Grant Likely <grant.likely-s3s/WqlpOiPyB63q8FvJNQ@public.gmane.org>
Cc: Cyril Chemparathy <cyril-l0cyMroinI0@public.gmane.org>
Cc: David Daney <ddaney-M3mlKVOIwJVv6pq1l3V1OdBPR1lH4CV8@public.gmane.org>
Cc: Arnaud Patard <arnaud.patard-dQbF7i+pzddAfugRpC6u6w@public.gmane.org>
Cc: Benjamin Herrenschmidt <benh-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r@public.gmane.org>
---

Note: this patch now depends on the recently sent 88E1149R support
patch, which was seperated into its own patch set.

I think I have incororated all feedback from Grant Likely and Cyril
Chemparathy, and Milton Miller.

David Daney

 drivers/net/phy/marvell.c |   97 +++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 97 insertions(+), 0 deletions(-)

diff --git a/drivers/net/phy/marvell.c b/drivers/net/phy/marvell.c
index def19d7..e8b9c53 100644
--- a/drivers/net/phy/marvell.c
+++ b/drivers/net/phy/marvell.c
@@ -30,6 +30,7 @@
 #include <linux/ethtool.h>
 #include <linux/phy.h>
 #include <linux/marvell_phy.h>
+#include <linux/of.h>
 
 #include <asm/io.h>
 #include <asm/irq.h>
@@ -187,6 +188,87 @@ static int marvell_config_aneg(struct phy_device *phydev)
 	return 0;
 }
 
+#ifdef CONFIG_OF_MDIO
+/*
+ * Set and/or override some configuration registers based on the
+ * marvell,reg-init property stored in the of_node for the phydev.
+ *
+ * marvell,reg-init = <reg-page reg mask value>,...;
+ *
+ * There may be one or more sets of <reg-page reg mask value>:
+ *
+ * reg-page: which register bank to use.
+ * reg: the register.
+ * mask: if non-zero, ANDed with existing register value.
+ * value: ORed with the masked value and written to the regiser.
+ *
+ */
+static int marvell_of_reg_init(struct phy_device *phydev)
+{
+	const __be32 *paddr;
+	int len, i, saved_page, current_page, page_changed, ret;
+
+	if (!phydev->dev.of_node)
+		return 0;
+
+	paddr = of_get_property(phydev->dev.of_node, "marvell,reg-init", &len);
+	if (!paddr || len < (4 * sizeof(*paddr)))
+		return 0;
+
+	saved_page = phy_read(phydev, MII_MARVELL_PHY_PAGE);
+	if (saved_page < 0)
+		return saved_page;
+	page_changed = 0;
+	current_page = saved_page;
+
+	ret = 0;
+	len /= sizeof(*paddr);
+	for (i = 0; i < len - 3; i += 4) {
+		u16 reg_page = be32_to_cpup(paddr + i);
+		u16 reg = be32_to_cpup(paddr + i + 1);
+		u16 mask = be32_to_cpup(paddr + i + 2);
+		u16 val_bits = be32_to_cpup(paddr + i + 3);
+		int val;
+
+		if (reg_page != current_page) {
+			current_page = reg_page;
+			page_changed = 1;
+			ret = phy_write(phydev, MII_MARVELL_PHY_PAGE, reg_page);
+			if (ret < 0)
+				goto err;
+		}
+
+		val = 0;
+		if (mask) {
+			val = phy_read(phydev, reg);
+			if (val < 0) {
+				ret = val;
+				goto err;
+			}
+			val &= mask;
+		}
+		val |= val_bits;
+
+		ret = phy_write(phydev, reg, val);
+		if (ret < 0)
+			goto err;
+
+	}
+err:
+	if (page_changed) {
+		i = phy_write(phydev, MII_MARVELL_PHY_PAGE, saved_page);
+		if (ret == 0)
+			ret = i;
+	}
+	return ret;
+}
+#else
+static int marvell_of_reg_init(struct phy_device *phydev)
+{
+	return 0;
+}
+#endif /* CONFIG_OF_MDIO */
+
 static int m88e1121_config_aneg(struct phy_device *phydev)
 {
 	int err, oldpage, mscr;
@@ -369,6 +451,9 @@ static int m88e1111_config_init(struct phy_device *phydev)
 			return err;
 	}
 
+	err = marvell_of_reg_init(phydev);
+	if (err < 0)
+		return err;
 
 	err = phy_write(phydev, MII_BMCR, BMCR_RESET);
 	if (err < 0)
@@ -421,6 +506,10 @@ static int m88e1118_config_init(struct phy_device *phydev)
 	if (err < 0)
 		return err;
 
+	err = marvell_of_reg_init(phydev);
+	if (err < 0)
+		return err;
+
 	/* Reset address */
 	err = phy_write(phydev, MII_MARVELL_PHY_PAGE, 0x0);
 	if (err < 0)
@@ -447,6 +536,10 @@ static int m88e1149_config_init(struct phy_device *phydev)
 	if (err < 0)
 		return err;
 
+	err = marvell_of_reg_init(phydev);
+	if (err < 0)
+		return err;
+
 	/* Reset address */
 	err = phy_write(phydev, MII_MARVELL_PHY_PAGE, 0x0);
 	if (err < 0)
@@ -518,6 +611,10 @@ static int m88e1145_config_init(struct phy_device *phydev)
 		}
 	}
 
+	err = marvell_of_reg_init(phydev);
+	if (err < 0)
+		return err;
+
 	return 0;
 }
 
-- 
1.7.2.3

^ permalink raw reply related

* Re: Generalizing mmap'ed sockets
From: Andrew Grover @ 2010-11-19 22:10 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Linux Netdev List
In-Reply-To: <AANLkTi=WF2hAGAufv_Anc=b=Fm2WOpOMOv1UrDRvaTHp@mail.gmail.com>

On Fri, Nov 19, 2010 at 12:04 PM, Tom Herbert <therbert@google.com> wrote:
> This is a project I'm contemplating.  If you have any comments or can
> point me to prior work in this area that would be appreciated.

> TX:
>  - Zero copy transmit (which is already supported by vmsplice(), but
> this might be simpler)
>  - One system call needed on transmit which can cover multiple
> datagrams or what would have been multiple writes (the call is just to
> kick kernel to start sending)

I'd look at our existing recvmmsg syscall -- there was talk of doing a
sendmmsg, which sounds close to what you want.

> RX:
>  - Zero system calls needed to do receive (determining data ready is
> accomplished by polling)
>  - Immediate data placement in kernel available all the time,
> including OOO placement
>  - Potential for true zero copy on receive with device support (like
> per flow queues, UDP queues)

Mentioning zero-copy per-flow queues in userspace suggests Infiniband
is prior work in this area.

Regards -- Andy

^ permalink raw reply

* Re: Generalizing mmap'ed sockets
From: David Miller @ 2010-11-19 22:08 UTC (permalink / raw)
  To: rick.jones2; +Cc: therbert, netdev
In-Reply-To: <4CE6F2FD.8080301@hp.com>

From: Rick Jones <rick.jones2@hp.com>
Date: Fri, 19 Nov 2010 13:58:21 -0800

> David Miller wrote:
>> From: Rick Jones <rick.jones2@hp.com>
>> Date: Fri, 19 Nov 2010 13:32:57 -0800
>> 
>>>I suppose then one would be able to track the consumer pointer (on tx)
>>>to "know" that certain data had been ACKed by the remote?  For TCP
>>>anyway - and assuming there wouldn't be a case where TCP might copy
>>>the data out of the ring and assert "completion."
>> Yes, that's implicit in his design, the kernel manages the consumer
>> pointer in the ring and this is how userspace can see when ring
>> entries
>> are reusable.
> 
> But does one really want to lock-in that the update to the consumer
> pointer means the data has been ACKed by the remote (or I suppose that
> DMA have completed if it were UDP)?

I think the ACK (or for UDP, the kfree_skb() after TX completes) should
move the consumer pointer.  Otherwise you have to copy, and the ACKs
do not clock the sender process properly.

But you do bring up an interesting point about TX buffer space sizing.

This whole scheme currently seems to completely ignore buffer size
auto-tuning done by TCP, and that won't fly I think. :-)

The whole point is to make it so that applications do not need to know
about that aspect of buffering at all.  With the current mmap design
we're back to the stone ages where the app essentially has to pick an
explicit send buffer size.




^ permalink raw reply

* [PATCH 2/2] phylib: Add support for Marvell 88E1149R devices.
From: David Daney @ 2010-11-19 21:58 UTC (permalink / raw)
  To: devicetree-discuss-uLR06cmDAlY/bJ5BZ2RsiQ,
	grant.likely-s3s/WqlpOiPyB63q8FvJNQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: David Daney, Arnaud Patard, Cyril Chemparathy
In-Reply-To: <1290203933-28251-1-git-send-email-ddaney-M3mlKVOIwJVv6pq1l3V1OdBPR1lH4CV8@public.gmane.org>

The 88E1149R is 10/100/1000 quad-gigabit Ethernet PHY.  The
.config_aneg function can be shared with 88E1118, but it needs its own
.config_init.

Signed-off-by: David Daney <ddaney-M3mlKVOIwJVv6pq1l3V1OdBPR1lH4CV8@public.gmane.org>
Cc: Cyril Chemparathy <cyril-l0cyMroinI0@public.gmane.org>
Cc: Arnaud Patard <arnaud.patard-dQbF7i+pzddAfugRpC6u6w@public.gmane.org>
Cc: Benjamin Herrenschmidt <benh-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r@public.gmane.org>
Cc: Wolfram Sang <w.sang-bIcnvbaLZ9MEGnE8C9+IrQ@public.gmane.org>
---
 drivers/net/phy/marvell.c   |   40 ++++++++++++++++++++++++++++++++++++++++
 include/linux/marvell_phy.h |    1 +
 2 files changed, 41 insertions(+), 0 deletions(-)

diff --git a/drivers/net/phy/marvell.c b/drivers/net/phy/marvell.c
index 3600b8b..def19d7 100644
--- a/drivers/net/phy/marvell.c
+++ b/drivers/net/phy/marvell.c
@@ -433,6 +433,32 @@ static int m88e1118_config_init(struct phy_device *phydev)
 	return 0;
 }
 
+static int m88e1149_config_init(struct phy_device *phydev)
+{
+	int err;
+
+	/* Change address */
+	err = phy_write(phydev, MII_MARVELL_PHY_PAGE, 0x0002);
+	if (err < 0)
+		return err;
+
+	/* Enable 1000 Mbit */
+	err = phy_write(phydev, 0x15, 0x1048);
+	if (err < 0)
+		return err;
+
+	/* Reset address */
+	err = phy_write(phydev, MII_MARVELL_PHY_PAGE, 0x0);
+	if (err < 0)
+		return err;
+
+	err = phy_write(phydev, MII_BMCR, BMCR_RESET);
+	if (err < 0)
+		return err;
+
+	return 0;
+}
+
 static int m88e1145_config_init(struct phy_device *phydev)
 {
 	int err;
@@ -686,6 +712,19 @@ static struct phy_driver marvell_drivers[] = {
 		.driver = { .owner = THIS_MODULE },
 	},
 	{
+		.phy_id = MARVELL_PHY_ID_88E1149R,
+		.phy_id_mask = MARVELL_PHY_ID_MASK,
+		.name = "Marvell 88E1149R",
+		.features = PHY_GBIT_FEATURES,
+		.flags = PHY_HAS_INTERRUPT,
+		.config_init = &m88e1149_config_init,
+		.config_aneg = &m88e1118_config_aneg,
+		.read_status = &genphy_read_status,
+		.ack_interrupt = &marvell_ack_interrupt,
+		.config_intr = &marvell_config_intr,
+		.driver = { .owner = THIS_MODULE },
+	},
+	{
 		.phy_id = MARVELL_PHY_ID_88E1240,
 		.phy_id_mask = MARVELL_PHY_ID_MASK,
 		.name = "Marvell 88E1240",
@@ -736,6 +775,7 @@ static struct mdio_device_id __maybe_unused marvell_tbl[] = {
 	{ 0x01410e10, 0xfffffff0 },
 	{ 0x01410cb0, 0xfffffff0 },
 	{ 0x01410cd0, 0xfffffff0 },
+	{ 0x01410e50, 0xfffffff0 },
 	{ 0x01410e30, 0xfffffff0 },
 	{ 0x01410e90, 0xfffffff0 },
 	{ }
diff --git a/include/linux/marvell_phy.h b/include/linux/marvell_phy.h
index 1ff81b5..dd3c34e 100644
--- a/include/linux/marvell_phy.h
+++ b/include/linux/marvell_phy.h
@@ -11,6 +11,7 @@
 #define MARVELL_PHY_ID_88E1118		0x01410e10
 #define MARVELL_PHY_ID_88E1121R		0x01410cb0
 #define MARVELL_PHY_ID_88E1145		0x01410cd0
+#define MARVELL_PHY_ID_88E1149R		0x01410e50
 #define MARVELL_PHY_ID_88E1240		0x01410e30
 #define MARVELL_PHY_ID_88E1318S		0x01410e90
 
-- 
1.7.2.3

^ permalink raw reply related

* [PATCH 1/2] phylib: Use common page register definition for Marvell PHYs.
From: David Daney @ 2010-11-19 21:58 UTC (permalink / raw)
  To: devicetree-discuss-uLR06cmDAlY/bJ5BZ2RsiQ,
	grant.likely-s3s/WqlpOiPyB63q8FvJNQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: David Daney, Arnaud Patard, Cyril Chemparathy
In-Reply-To: <1290203933-28251-1-git-send-email-ddaney-M3mlKVOIwJVv6pq1l3V1OdBPR1lH4CV8@public.gmane.org>

The definition of the Marvell PHY page register is not specific to
88E1121, so rename the macro to MII_MARVELL_PHY_PAGE, and use it
throughout.

Suggested-by: Cyril Chemparathy <cyril-l0cyMroinI0@public.gmane.org>
Signed-off-by: David Daney <ddaney-M3mlKVOIwJVv6pq1l3V1OdBPR1lH4CV8@public.gmane.org>
Cc: Cyril Chemparathy <cyril-l0cyMroinI0@public.gmane.org>
Cc: Arnaud Patard <arnaud.patard-dQbF7i+pzddAfugRpC6u6w@public.gmane.org>
Cc: Benjamin Herrenschmidt <benh-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r@public.gmane.org>
---
 drivers/net/phy/marvell.c |   27 ++++++++++++++-------------
 1 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/drivers/net/phy/marvell.c b/drivers/net/phy/marvell.c
index f0bd1a1..3600b8b 100644
--- a/drivers/net/phy/marvell.c
+++ b/drivers/net/phy/marvell.c
@@ -35,6 +35,8 @@
 #include <asm/irq.h>
 #include <asm/uaccess.h>
 
+#define MII_MARVELL_PHY_PAGE		22
+
 #define MII_M1011_IEVENT		0x13
 #define MII_M1011_IEVENT_CLEAR		0x0000
 
@@ -80,7 +82,6 @@
 #define MII_88E1121_PHY_LED_CTRL	16
 #define MII_88E1121_PHY_LED_PAGE	3
 #define MII_88E1121_PHY_LED_DEF		0x0030
-#define MII_88E1121_PHY_PAGE		22
 
 #define MII_M1011_PHY_STATUS		0x11
 #define MII_M1011_PHY_STATUS_1000	0x8000
@@ -190,9 +191,9 @@ static int m88e1121_config_aneg(struct phy_device *phydev)
 {
 	int err, oldpage, mscr;
 
-	oldpage = phy_read(phydev, MII_88E1121_PHY_PAGE);
+	oldpage = phy_read(phydev, MII_MARVELL_PHY_PAGE);
 
-	err = phy_write(phydev, MII_88E1121_PHY_PAGE,
+	err = phy_write(phydev, MII_MARVELL_PHY_PAGE,
 			MII_88E1121_PHY_MSCR_PAGE);
 	if (err < 0)
 		return err;
@@ -218,7 +219,7 @@ static int m88e1121_config_aneg(struct phy_device *phydev)
 			return err;
 	}
 
-	phy_write(phydev, MII_88E1121_PHY_PAGE, oldpage);
+	phy_write(phydev, MII_MARVELL_PHY_PAGE, oldpage);
 
 	err = phy_write(phydev, MII_BMCR, BMCR_RESET);
 	if (err < 0)
@@ -229,11 +230,11 @@ static int m88e1121_config_aneg(struct phy_device *phydev)
 	if (err < 0)
 		return err;
 
-	oldpage = phy_read(phydev, MII_88E1121_PHY_PAGE);
+	oldpage = phy_read(phydev, MII_MARVELL_PHY_PAGE);
 
-	phy_write(phydev, MII_88E1121_PHY_PAGE, MII_88E1121_PHY_LED_PAGE);
+	phy_write(phydev, MII_MARVELL_PHY_PAGE, MII_88E1121_PHY_LED_PAGE);
 	phy_write(phydev, MII_88E1121_PHY_LED_CTRL, MII_88E1121_PHY_LED_DEF);
-	phy_write(phydev, MII_88E1121_PHY_PAGE, oldpage);
+	phy_write(phydev, MII_MARVELL_PHY_PAGE, oldpage);
 
 	err = genphy_config_aneg(phydev);
 
@@ -244,9 +245,9 @@ static int m88e1318_config_aneg(struct phy_device *phydev)
 {
 	int err, oldpage, mscr;
 
-	oldpage = phy_read(phydev, MII_88E1121_PHY_PAGE);
+	oldpage = phy_read(phydev, MII_MARVELL_PHY_PAGE);
 
-	err = phy_write(phydev, MII_88E1121_PHY_PAGE,
+	err = phy_write(phydev, MII_MARVELL_PHY_PAGE,
 			MII_88E1121_PHY_MSCR_PAGE);
 	if (err < 0)
 		return err;
@@ -258,7 +259,7 @@ static int m88e1318_config_aneg(struct phy_device *phydev)
 	if (err < 0)
 		return err;
 
-	err = phy_write(phydev, MII_88E1121_PHY_PAGE, oldpage);
+	err = phy_write(phydev, MII_MARVELL_PHY_PAGE, oldpage);
 	if (err < 0)
 		return err;
 
@@ -398,7 +399,7 @@ static int m88e1118_config_init(struct phy_device *phydev)
 	int err;
 
 	/* Change address */
-	err = phy_write(phydev, 0x16, 0x0002);
+	err = phy_write(phydev, MII_MARVELL_PHY_PAGE, 0x0002);
 	if (err < 0)
 		return err;
 
@@ -408,7 +409,7 @@ static int m88e1118_config_init(struct phy_device *phydev)
 		return err;
 
 	/* Change address */
-	err = phy_write(phydev, 0x16, 0x0003);
+	err = phy_write(phydev, MII_MARVELL_PHY_PAGE, 0x0003);
 	if (err < 0)
 		return err;
 
@@ -421,7 +422,7 @@ static int m88e1118_config_init(struct phy_device *phydev)
 		return err;
 
 	/* Reset address */
-	err = phy_write(phydev, 0x16, 0x0);
+	err = phy_write(phydev, MII_MARVELL_PHY_PAGE, 0x0);
 	if (err < 0)
 		return err;
 
-- 
1.7.2.3

^ permalink raw reply related

* [PATCH 0/2] phylib: Cleanup marvell.c and add 88E1149R support.
From: David Daney @ 2010-11-19 21:58 UTC (permalink / raw)
  To: devicetree-discuss-uLR06cmDAlY/bJ5BZ2RsiQ,
	grant.likely-s3s/WqlpOiPyB63q8FvJNQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: David Daney, Arnaud Patard, Cyril Chemparathy

This is the second iteration of this patch.  I have split out the
device tree support from the first version to a different patch set.
The 88E1149R support is useful 'stand alone', so if it is acceptable,
it can be merged first.

The first patch is a small cleanup suggested by Cyril Chemparathy, the
second one adds basic 88E1149R support.

David Daney (2):
  phylib: Use common page register definition for Marvell PHYs.
  phylib: Add support for Marvell 88E1149R devices.

 drivers/net/phy/marvell.c   |   67 ++++++++++++++++++++++++++++++++++--------
 include/linux/marvell_phy.h |    1 +
 2 files changed, 55 insertions(+), 13 deletions(-)

Cc: Cyril Chemparathy <cyril-l0cyMroinI0@public.gmane.org>
Cc: Arnaud Patard <arnaud.patard-dQbF7i+pzddAfugRpC6u6w@public.gmane.org>
Cc: Benjamin Herrenschmidt <benh-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r@public.gmane.org>
Cc: Wolfram Sang <w.sang-bIcnvbaLZ9MEGnE8C9+IrQ@public.gmane.org>
-- 
1.7.2.3

^ permalink raw reply

* Re: Generalizing mmap'ed sockets
From: Rick Jones @ 2010-11-19 21:58 UTC (permalink / raw)
  To: David Miller; +Cc: therbert, netdev
In-Reply-To: <20101119.135213.15239226.davem@davemloft.net>

David Miller wrote:
> From: Rick Jones <rick.jones2@hp.com>
> Date: Fri, 19 Nov 2010 13:32:57 -0800
> 
>>I suppose then one would be able to track the consumer pointer (on tx)
>>to "know" that certain data had been ACKed by the remote?  For TCP
>>anyway - and assuming there wouldn't be a case where TCP might copy
>>the data out of the ring and assert "completion."
> 
> Yes, that's implicit in his design, the kernel manages the consumer
> pointer in the ring and this is how userspace can see when ring entries
> are reusable.

But does one really want to lock-in that the update to the consumer pointer 
means the data has been ACKed by the remote (or I suppose that DMA have 
completed if it were UDP)?  We can think of no case where the stack will want to 
copy out of the ring and assert completion to the user before it got ACKed by 
the remote?  Say when the stack wants to autotune the send socket buffer size to 
something larger than the tx ring?

rick

^ permalink raw reply

* [70/82] Limit sysctl_tcp_mem and sysctl_udp_mem initializers to prevent integer overflows.
From: Greg KH @ 2010-11-19 21:56 UTC (permalink / raw)
  To: linux-kernel, stable, David S. Miller
  Cc: stable-review, torvalds, akpm, alan, Robin Holt, Willy Tarreau,
	netdev, linux-sctp, Alexey Kuznetsov, Pekka Savola (ipv6),
	James Morris, Hideaki YOSHIFUJI, Patrick McHardy, Vlad Yasevich,
	Sridhar Samudrala

2.6.35-stable review patch.  If anyone has any objections, please let us know.

------------------


On a 16TB x86_64 machine, sysctl_tcp_mem[2], sysctl_udp_mem[2], and
sysctl_sctp_mem[2] can integer overflow.  Set limit such that they are
maximized without overflowing.

Signed-off-by: Robin Holt <holt@sgi.com>
To: "David S. Miller" <davem@davemloft.net>
Cc: Willy Tarreau <w@1wt.eu>
Cc: linux-kernel@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: linux-sctp@vger.kernel.org
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: "Pekka Savola (ipv6)" <pekkas@netcore.fi>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Vlad Yasevich <vladislav.yasevich@hp.com>
Cc: Sridhar Samudrala <sri@us.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
---
 net/ipv4/tcp.c      |    4 +++-
 net/ipv4/udp.c      |    4 +++-
 net/sctp/protocol.c |    4 +++-
 3 files changed, 9 insertions(+), 3 deletions(-)

--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3252,12 +3252,14 @@ void __init tcp_init(void)
 
 	/* Set the pressure threshold to be a fraction of global memory that
 	 * is up to 1/2 at 256 MB, decreasing toward zero with the amount of
-	 * memory, with a floor of 128 pages.
+	 * memory, with a floor of 128 pages, and a ceiling that prevents an
+	 * integer overflow.
 	 */
 	nr_pages = totalram_pages - totalhigh_pages;
 	limit = min(nr_pages, 1UL<<(28-PAGE_SHIFT)) >> (20-PAGE_SHIFT);
 	limit = (limit * (nr_pages >> (20-PAGE_SHIFT))) >> (PAGE_SHIFT-11);
 	limit = max(limit, 128UL);
+	limit = min(limit, INT_MAX * 4UL / 3 / 2);
 	sysctl_tcp_mem[0] = limit / 4 * 3;
 	sysctl_tcp_mem[1] = limit;
 	sysctl_tcp_mem[2] = sysctl_tcp_mem[0] * 2;
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2167,12 +2167,14 @@ void __init udp_init(void)
 	udp_table_init(&udp_table, "UDP");
 	/* Set the pressure threshold up by the same strategy of TCP. It is a
 	 * fraction of global memory that is up to 1/2 at 256 MB, decreasing
-	 * toward zero with the amount of memory, with a floor of 128 pages.
+	 * toward zero with the amount of memory, with a floor of 128 pages,
+	 * and a ceiling that prevents an integer overflow.
 	 */
 	nr_pages = totalram_pages - totalhigh_pages;
 	limit = min(nr_pages, 1UL<<(28-PAGE_SHIFT)) >> (20-PAGE_SHIFT);
 	limit = (limit * (nr_pages >> (20-PAGE_SHIFT))) >> (PAGE_SHIFT-11);
 	limit = max(limit, 128UL);
+	limit = min(limit, INT_MAX * 4UL / 3 / 2);
 	sysctl_udp_mem[0] = limit / 4 * 3;
 	sysctl_udp_mem[1] = limit;
 	sysctl_udp_mem[2] = sysctl_udp_mem[0] * 2;
--- a/net/sctp/protocol.c
+++ b/net/sctp/protocol.c
@@ -1161,7 +1161,8 @@ SCTP_STATIC __init int sctp_init(void)
 
 	/* Set the pressure threshold to be a fraction of global memory that
 	 * is up to 1/2 at 256 MB, decreasing toward zero with the amount of
-	 * memory, with a floor of 128 pages.
+	 * memory, with a floor of 128 pages, and a ceiling that prevents an
+	 * integer overflow.
 	 * Note this initalizes the data in sctpv6_prot too
 	 * Unabashedly stolen from tcp_init
 	 */
@@ -1169,6 +1170,7 @@ SCTP_STATIC __init int sctp_init(void)
 	limit = min(nr_pages, 1UL<<(28-PAGE_SHIFT)) >> (20-PAGE_SHIFT);
 	limit = (limit * (nr_pages >> (20-PAGE_SHIFT))) >> (PAGE_SHIFT-11);
 	limit = max(limit, 128UL);
+	limit = min(limit, INT_MAX * 4UL / 3 / 2);
 	sysctl_sctp_mem[0] = limit / 4 * 3;
 	sysctl_sctp_mem[1] = limit;
 	sysctl_sctp_mem[2] = sysctl_sctp_mem[0] * 2;

^ permalink raw reply

* Re: Generalizing mmap'ed sockets
From: Tom Herbert @ 2010-11-19 21:55 UTC (permalink / raw)
  To: David Miller; +Cc: rick.jones2, netdev
In-Reply-To: <20101119.135213.15239226.davem@davemloft.net>

On Fri, Nov 19, 2010 at 1:52 PM, David Miller <davem@davemloft.net> wrote:
> From: Rick Jones <rick.jones2@hp.com>
> Date: Fri, 19 Nov 2010 13:32:57 -0800
>
>> I suppose then one would be able to track the consumer pointer (on tx)
>> to "know" that certain data had been ACKed by the remote?  For TCP
>> anyway - and assuming there wouldn't be a case where TCP might copy
>> the data out of the ring and assert "completion."
>
> Yes, that's implicit in his design, the kernel manages the consumer
> pointer in the ring and this is how userspace can see when ring entries
> are reusable.
>

And, for stream sockets the ring would be one big contiguous buffer,
for datagram would be packetized buffer like with packet interface.

^ permalink raw reply

* Re: Generalizing mmap'ed sockets
From: David Miller @ 2010-11-19 21:52 UTC (permalink / raw)
  To: rick.jones2; +Cc: therbert, netdev
In-Reply-To: <4CE6ED09.70602@hp.com>

From: Rick Jones <rick.jones2@hp.com>
Date: Fri, 19 Nov 2010 13:32:57 -0800

> I suppose then one would be able to track the consumer pointer (on tx)
> to "know" that certain data had been ACKed by the remote?  For TCP
> anyway - and assuming there wouldn't be a case where TCP might copy
> the data out of the ring and assert "completion."

Yes, that's implicit in his design, the kernel manages the consumer
pointer in the ring and this is how userspace can see when ring entries
are reusable.

^ permalink raw reply

* Re: [PATCH net-next-2.6 16/17] can: EG20T PCH: Fix incorrect return processing
From: Marc Kleine-Budde @ 2010-11-19 21:44 UTC (permalink / raw)
  To: Tomoya MORINAGA
  Cc: andrew.chih.howe.khor-ral2JQCrhuEAvxtiuMwx3w, Samuel Ortiz,
	margie.foster-ral2JQCrhuEAvxtiuMwx3w,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	socketcan-core-0fE9KPoRgkgATYTw5x5z8w,
	yong.y.wang-ral2JQCrhuEAvxtiuMwx3w,
	kok.howg.ewe-ral2JQCrhuEAvxtiuMwx3w, Wolfgang Grandegger,
	joel.clark-ral2JQCrhuEAvxtiuMwx3w, David S. Miller,
	Christian Pellegrin, qi.wang-ral2JQCrhuEAvxtiuMwx3w
In-Reply-To: <4CE6104C.3090401-ECg8zkTtlr0C6LszWs/t0g@public.gmane.org>


[-- Attachment #1.1: Type: text/plain, Size: 2811 bytes --]

On 11/19/2010 06:51 AM, Tomoya MORINAGA wrote:
> Fix incorrect return processing

The description is correct. But you change several things that have
nothing to do. The frist hunk fixes shared irq handlers, the others the
quota calculation in the napi handler.

Marc

> 
> Signed-off-by: Tomoya MORINAGA <tomoya-linux-ECg8zkTtlr0C6LszWs/t0g@public.gmane.org>
> ---
>  drivers/net/can/pch_can.c |   20 ++++++++++++--------
>  1 files changed, 12 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/net/can/pch_can.c b/drivers/net/can/pch_can.c
> index c612a99..48f4a2e 100644
> --- a/drivers/net/can/pch_can.c
> +++ b/drivers/net/can/pch_can.c
> @@ -589,10 +589,12 @@ static irqreturn_t pch_can_interrupt(int irq, void *dev_id)
>  	struct net_device *ndev = (struct net_device *)dev_id;
>  	struct pch_can_priv *priv = netdev_priv(ndev);
>  
> -	pch_can_set_int_enables(priv, PCH_CAN_NONE);
> -	napi_schedule(&priv->napi);
> -
> -	return IRQ_HANDLED;
> +	if ((pch_can_int_pending(priv) > 0) && (dev_id != NULL)) {

dev_id is always != NULL, because you registered your IRQ handler with
it. (BTW: dev_id has already been dereferenced in netdev_priv(), so if
this code is executed, dev_if is != NULL)

Just write:

if (!pch_can_int_pending(priv))
	return IRQ_NONE;

> +		pch_can_set_int_enables(priv, PCH_CAN_NONE);
> +		napi_schedule(&priv->napi);
> +		return IRQ_HANDLED;
> +	}
> +	return IRQ_NONE;
>  }
>  
>  static void pch_fifo_thresh(struct pch_can_priv *priv, int obj_id)
> @@ -674,7 +676,7 @@ static int pch_can_rx_normal(struct net_device *ndev, u32 obj_num, int quota)
>  		if (reg & PCH_IF_MCONT_MSGLOST) {
>  			rtn = pch_can_rx_msg_lost(ndev, obj_num);
>  			if (!rtn)
> -				return rtn;
> +				return rcv_pkts;
>  			rcv_pkts++;
>  			quota--;
>  			obj_num++;
> @@ -777,10 +779,12 @@ static int pch_can_poll(struct napi_struct *napi, int quota)
>  		goto end;
>  
>  	if ((int_stat >= PCH_RX_OBJ_START) && (int_stat <= PCH_RX_OBJ_END)) {
> -		rcv_pkts += pch_can_rx_normal(ndev, int_stat, quota);
> -		quota -= rcv_pkts;
> -		if (quota < 0)
> +		rcv_pkts = pch_can_rx_normal(ndev, int_stat, quota);

maybe it's better to rx as much packages in rx_normal as possible and
not return if you failed to alloc a can_frame.

> +		if (rcv_pkts < 0) {
> +			rcv_pkts = 0;
>  			goto end;
> +		}
> +		quota -= rcv_pkts;
>  	} else if ((int_stat >= PCH_TX_OBJ_START) &&
>  		   (int_stat <= PCH_TX_OBJ_END)) {
>  		/* Handle transmission interrupt */


-- 
Pengutronix e.K.                  | Marc Kleine-Budde           |
Industrial Linux Solutions        | Phone: +49-231-2826-924     |
Vertretung West/Dortmund          | Fax:   +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |


[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

[-- Attachment #2: Type: text/plain, Size: 188 bytes --]

_______________________________________________
Socketcan-core mailing list
Socketcan-core-0fE9KPoRgkgATYTw5x5z8w@public.gmane.org
https://lists.berlios.de/mailman/listinfo/socketcan-core

^ permalink raw reply

* Re: Generalizing mmap'ed sockets
From: Rick Jones @ 2010-11-19 21:32 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Linux Netdev List
In-Reply-To: <AANLkTi=WF2hAGAufv_Anc=b=Fm2WOpOMOv1UrDRvaTHp@mail.gmail.com>

I suppose then one would be able to track the consumer pointer (on tx) to "know" 
that certain data had been ACKed by the remote?  For TCP anyway - and assuming 
there wouldn't be a case where TCP might copy the data out of the ring and 
assert "completion."

How would the mmap'ing interact with autotuning?  Particularly in the "future 
case" of HW support for true zero copy on receive.

Today I can be assured that data I receieve is on a "nice" boundary in my memory 
by virtue of the buffer pointer I pass to the recv() call, but with the "rings" 
(they are simply some chunk of virtually continguous data yes?) it will just be 
bumped up against the end of the last one - I'm wondering if that might not be a 
problem, especially for UDP datagrams, but even for TCP data?  It is one thing 
to have people pad structures in the memory of a system, but telling them to 
pad-out the messages they send across the network to maintain alignment is 
rather different.

How do you differentiate between a "there is data to send" and "I want to send a 
zero-legnth datagram for a UDP socket?  I think you will have to do/overload 
something other than a send() call for the trigger.

rick jones

^ permalink raw reply

* Re: [RFC][PATCH 0/5] RFS hardware acceleration (v2)
From: Rick Jones @ 2010-11-19 21:16 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: David Miller, Tom Herbert, netdev, linux-net-drivers
In-Reply-To: <1290194386.2671.59.camel@bwh-desktop>

Ben Hutchings wrote:
> So, some preliminary benchmark results.
> 
> Tom said he was using 200 concurrent netperf TCP_RR tests, so I've done
> the same, using netperf 2.4.1 (a bit out of date, I know).

Not a huge deal for a basic TCP_RR test though.

<data snipped>

> So accelerated RFS gave a 6-13% improvement over software RFS in
> transaction rate for these various cases.

Do you have any data on frequency with which unpinned netperf processes migrated 
from one core to another?  Or PCIe utilization?

rick jones

^ permalink raw reply

* Re: linux-next: manual merge of the net tree with the net-current tree
From: David Miller @ 2010-11-19 21:14 UTC (permalink / raw)
  To: sfr; +Cc: netdev, linux-next, linux-kernel, eric.dumazet
In-Reply-To: <20101119111706.d68efe13.sfr@canb.auug.org.au>

From: Stephen Rothwell <sfr@canb.auug.org.au>
Date: Fri, 19 Nov 2010 11:17:06 +1100

> Hi all,
> 
> Today's linux-next merge of the net tree got a conflict in
> drivers/net/bonding/bond_main.c between commit
> 3006bc38895895f1a0352c2e17e1a503f35f7e2f ("bonding: fix a race in IGMP
> handling") from the net-current tree and commit
> 866f3b25a2eb60d7529c227a0ecd80c3aba443fd ("bonding: IGMP handling
> cleanup") from the net tree.
> 
> I just assumed that the latter is a better solution and used that.

Yep, that's correct, and I'm doing the same as I merge things myself
right now.

Thanks!

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox