Netdev List
 help / color / mirror / Atom feed
* RE: Launch Time Support
From: Vick, Matthew @ 2012-12-17 21:44 UTC (permalink / raw)
  To: Ulf samuelsson; +Cc: netdev@vger.kernel.org
In-Reply-To: <EC58F455-4B86-48F7-95B7-A15C6FC98024@emagii.com>

> -----Original Message-----
> From: Ulf samuelsson [mailto:netdev@emagii.com]
> Sent: Friday, December 14, 2012 11:35 PM
> To: Vick, Matthew
> Cc: netdev@vger.kernel.org
> Subject: Re: Launch Time Support
> 
> 
> 15 dec 2012 kl. 01:45 skrev "Vick, Matthew" <matthew.vick@intel.com>:
> 
> >> -----Original Message-----
> >> From: netdev-owner@vger.kernel.org [mailto:netdev-
> >> owner@vger.kernel.org] On Behalf Of Ulf Samuelsson
> >> Sent: Wednesday, December 12, 2012 5:04 PM
> >> To: netdev@vger.kernel.org
> >> Subject: RFC: Launch Time Support
> >>
> >> Hi, I am looking for some feedback on how to implement launchtime in
> >> the kernel.
> >>
> >> I.E: You define WHEN you want to send a packet, and the driver will
> >> store the packet in a buffer and will send it out on the net when
> the
> >> internal timestamp counter in the network controller reaches the
> >> specified "launch time".
> >>
> >> Some Ethernet controllers like the new Intel i210 support "launch
> >> time",
> >>
> >> Support for launch time is desirable for any isochronous connection,
> >> but I am currently interested in the NTP protocol to improve the
> >> timing.
> >>
> >> Proposed Changes to the Kernel
> >> ===========================================================
> >> The launchtime support will be dependent on CONFIG_NET_LAUNCHTIME If
> >> this is not set, then the kernel functionality is not changed.
> >>
> >> My current idea is to add a new bit to the "flags" field of
> >> "socket.c:sendto"
> >> #define MSG_LAUNCHTIME 0x?????
> >>
> >> struct msghdr gets an additional launchtime field.
> >>
> >> sendto will check if the flags parameter contains MSG_LAUNCHTIME.
> >> If it does, then the first 64 bit longword of the packet (buff)
> >> contains the launchtime.
> >> The launchtime from the buffer is copied to the msghdr.launchtime
> >> field, and the first 64 bits of the packet is then shaved off,
> before
> >> the address is written to the msghdr.
> >>
> >> Each network controller supporting launchtime needs to have an
> >> alternative call to "send packet with launchtime" . This call adds
> >> the launchtime parameter.
> >> If launchtime is supported the exported "ops" includes the new call.
> >>
> >> The UDP/IP packet send will check the MSG_LAUNCHTIME and if set, it
> >> will check if the "send packet with launchtime" call is available
> for
> >> the driver and if so call it, otherwise it will call the normal send
> >> packet and thus ignore the launchtime.
> >>
> >> Before launchtime is used, the application should send an ioctl to
> >> the driver, making sure that launchtime is configured, and only if
> >> the driver ACKs , the application will use launchtime.
> >>
> >> (Possibly the "ops" field for "send packet with launchtime" should
> be
> >> NULL until that ioctl is complete. Comments?)
> >>
> >> To me, this seems to be transparent for all other network stacks so
> >> protocols and drivers not supporting launchtime should still work.
> >>
> >> As far as I know, drivers do not support launch time today.
> >> The Intel igb driver does not in the latest version on the intel web
> >> site, There are some defines headers in the latest version  defining
> >> the registers but so far, the code is not using it.
> >>
> >> There is the linux_igb_avb project on sourceforge which  allows use
> >> of launch time for user space applications, but not as part of the
> kernel.
> >>
> >> Maybe there is more work done somewhere else, but i am not aware of
> >> this, so any links to such work is appreciated.
> >>
> >> There are some FPGA based PCIe boards that support launchtime
> (Endace
> >> DAG) using proprietary APIs.
> >> Talked to some vendors providing TCP/IP offload engines for FPGA and
> >> they do not support launchtime and liuke Endace use proprietary APIs
> >> so they are only useable by custom programs. Normal networking
> >> interfaces are not supported.
> >>
> >> Comment on above is appreciated.
> >>
> >> BACKGROUND
> >> For those that do not know how the NTP protocol works:
> >> ===================================================
> >> The client sends an UDP packet to the NTP server using port 123 The
> >> NTP client reads the current systime and puts that in the outgoing
> packet.
> >> There is a delay between the time the systime is read, and the time
> >> the packet actually leaves the Ethernet controller adding jitter to
> >> the NTP algorithm.
> >>
> >> When the server receives the packet, it can be timestamped in H/W
> and
> >> a CMSG is then created by the network stack containing that
> timestamp
> >> for use by the server NTP daemon.
> >>
> >> The server generates a reply, which needs to include the client
> >> transmit time, the servers receive time, and the servers transmit
> time.
> >> Again, the transmit time needs to be written into the NTP packet,
> and
> >> then it needs to be processed through the network stack before it is
> >> leaving the ethernet controller causing more jitter.
> >>
> >> If launch time is supported, then the client NTP daemon would simply
> >> read the systime, add a constant delay to create the transmit
> >> timestamp.
> >> The delay needs to be sufficiently large to ensure that all
> >> processing is done,
> >>
> >> The server will do something similar adding a constant to the server
> >> receive timestamp to create the server transmit timestamp.
> >> If both the client and the server uses H/W timestamping and launch
> >> time, then the the jitter ideally is reduced to zero.
> >>
> >> TRANSMIT TIMESTAMPING
> >> ========================
> >> Support for TX timestamps in H/W is not really useful, since you
> need
> >> to provide the TX timestamp in the packet you measure on, so when
> you
> >> know the timestamp it is too late. Server to server  NTP connections
> >> support sending that timestamp in a new packet, but there is no such
> >> support in client server communication.
> >>
> >> The i210 supports putting the timestamp inside the packet as it
> >> leaves the Ethernet controller, but that means that you screw up the
> >> UDP checksum, so the packet will be rejected by the receiving NTP
> daemon.
> >> In addition, the i210 timestamp measures seconds and nanoseconds
> >> which is incompatible with the NTP timestamp which uses seconds and
> a
> >> 32 bit fraction of a second so that does not work either.
> >>
> >> Best Regards
> >> Ulf Samuelsson
> >> eMagii.
> >
> > Ulf,
> >
> > I have been looking into adding launch time support as part of
> enabling some of the I210 functionality you have described (such as in
> linux_igb_avb on SourceForge) upstream--less focused on NTP and more
> focused on AVB, but launch time will be necessary for both. If you
> would like, please feel free to contact me and I would love to work
> with you on this.
> >
> > Reading your proposal, I'm a little confused by which systime you're
> referring to. Do you mean on the host or on the NIC? In the case of
> hardware timestamping today, in igb we set the SYSTIM registers to the
> current system time, but that doesn't mean that the host clock and the
> NIC clock stay synced. You would either need a mechanism such as PPS
> (which igb does not implement today) to sync the host clock to the NIC
> clock or have the NTP daemon account for the discrepancy. Off the top
> of my head, I want to say modern PTP daemons (such as ptp4l) account
> for the discrepancy in the daemon.
> >
> > Cheers,
> > Matthew
> 
> We live in luxury, having access to a Cesium Clock ;-) and we define
> the time, beeing a top-level (Stratum 1) server.
> 
> There are some I/Os on the i210 that can be used to interface to the
> PPS.
> 
> As for reading systime, it is done indirectly as you get the systime as
> part of the NTP incoming packet. (It is timestamped at reception) and
> add the constant to that value.
> 
> Best Regards
> Ulf Samuelsson

So your proposal is to use a PPS interface (from some Stratum 1 server) to drive the clock on an I210 so you can use the I210's launch time mechanism to send packets at a certain time--is this correct? Or are you talking more about a software launch time solution?

Forgive my ignorance, but what constant are you referring to in the receive path? Based on your first e-mail, you mention the constant should be added to the transmit path.

Also, how will you account for hardware discrepancies? For example, how far in the future you can schedule a packet will differ from hardware to hardware.

Cheers,
Matthew

^ permalink raw reply

* Re: Whence a description of how to enable TCP FASTOPEN in a net-next kernel?
From: Eric Dumazet @ 2012-12-17 21:56 UTC (permalink / raw)
  To: Rick Jones; +Cc: netdev
In-Reply-To: <50CF8E2A.5020201@hp.com>

On Mon, 2012-12-17 at 13:27 -0800, Rick Jones wrote:
> Is there a writeup describing the steps needed to enable TCP_FASTOPEN in 
> a net-next kernel? (pulled earlier today)
> 
> I am looking to debug netperf's support for enabling the feature and I 
> want to make sure I've enabled things correctly in the kernel.  Thusfar 
> I've set the tcp_fastopen sysctl to one, and I see the "client" side of 
> netperf making the appropriate sendto() call, and I see what appears to 
> be the correct setsockopt being set on the server side, but my tcpdump 
> traces of the traffic flowing over loopback in my test setup, while 
> showing the client including the experimental option, do not show the 
> server side responding:
> 
> 13:10:23.870202 IP localhost.5923 > localhost.54363: Flags [S], seq 
> 935361110, win 43690, options [mss 65495,sackOK,TS val 889762 ecr 
> 0,nop,wscale 7,Unknown Option 254f989], length 0
> 13:10:23.870214 IP localhost.54363 > localhost.5923: Flags [S.], seq 
> 4210640362, ack 935361111, win 43690, options [mss 65495,sackOK,TS val 
> 889762 ecr 889762,nop,wscale 7], length 0
> 
> The netserver side strace snippet:
> 
> 3861  socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 8
> 3861  getsockopt(8, SOL_SOCKET, SO_SNDBUF, [16384], [4]) = 0
> 3861  getsockopt(8, SOL_SOCKET, SO_RCVBUF, [87380], [4]) = 0
> 3861  setsockopt(8, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> 3861  bind(8, {sa_family=AF_INET, sin_port=htons(0), 
> sin_addr=inet_addr("0.0.0.0")}, 16) = 0
> 3861  setsockopt(8, SOL_TCP, 0x17 /* TCP_??? */, [5], 4) = 0
> 3861  listen(8, 5)                      = 0
> ...
> 3861  accept(8, {sa_family=AF_INET, sin_port=htons(5923), 
> sin_addr=inet_addr("127.0.0.1")}, [16]) = 9
> 3861  recvfrom(9, "n", 1, 0, NULL, NULL) = 1
> 3861  sendto(9, "n", 1, 0, NULL, 0)     = 1
> 3861  getsockopt(9, SOL_SOCKET, SO_RCVBUF, [262030], [4]) = 0
> 3861  getsockopt(9, SOL_SOCKET, SO_SNDBUF, [663750], [4]) = 0
> 3861  close(9)                          = 0

> lather, rinse, repeat the accept sequence off that listen endpoint.
> 
> happy benchmarking,

I guess you need to enable fastopen both for client and server :

echo 3 >/proc/sys//net/ipv4/tcp_fastopen

^ permalink raw reply

* Re: Whence a description of how to enable TCP FASTOPEN in a net-next kernel?
From: Eric Dumazet @ 2012-12-17 22:03 UTC (permalink / raw)
  To: Rick Jones; +Cc: netdev
In-Reply-To: <1355781411.9380.12.camel@edumazet-glaptop>

On Mon, 2012-12-17 at 13:56 -0800, Eric Dumazet wrote:

> 
> I guess you need to enable fastopen both for client and server :
> 
> echo 3 >/proc/sys//net/ipv4/tcp_fastopen

vi +475 Documentation/networking/ip-sysctl.txt

tcp_fastopen - INTEGER
        Enable TCP Fast Open feature (draft-ietf-tcpm-fastopen) to send data
        in the opening SYN packet. To use this feature, the client application
        must use sendmsg() or sendto() with MSG_FASTOPEN flag rather than
        connect() to perform a TCP handshake automatically.

        The values (bitmap) are
        1: Enables sending data in the opening SYN on the client.
        2: Enables TCP Fast Open on the server side, i.e., allowing data in
           a SYN packet to be accepted and passed to the application before
           3-way hand shake finishes.
        4: Send data in the opening SYN regardless of cookie availability and
           without a cookie option.
        0x100: Accept SYN data w/o validating the cookie.
        0x200: Accept data-in-SYN w/o any cookie option present.
        0x400/0x800: Enable Fast Open on all listeners regardless of the
           TCP_FASTOPEN socket option. The two different flags designate two
           different ways of setting max_qlen without the TCP_FASTOPEN socket
           option.

        Default: 0

        Note that the client & server side Fast Open flags (1 and 2
        respectively) must be also enabled before the rest of flags can take
        effect.

        See include/net/tcp.h and the code for more details.

^ permalink raw reply

* Re: [GIT PULL net-next] NDISC Updates (sender-side clean-up)
From: David Miller @ 2012-12-17 22:31 UTC (permalink / raw)
  To: yoshfuji; +Cc: netdev
In-Reply-To: <50CF84A5.7030706@linux-ipv6.org>


Sorry, you cannot just send a pull request without posting
the patches as well for people to review.

I'm not pulling from your tree without any posting of the
patches for review.

^ permalink raw reply

* Re: Whence a description of how to enable TCP FASTOPEN in a net-next kernel?
From: Rick Jones @ 2012-12-17 22:56 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1355781411.9380.12.camel@edumazet-glaptop>

On 12/17/2012 01:56 PM, Eric Dumazet wrote:
> On Mon, 2012-12-17 at 13:27 -0800, Rick Jones wrote:
>> [an explanation of what he'd done that hadn't worked]
>
> I guess you need to enable fastopen both for client and server :
>
> echo 3 >/proc/sys//net/ipv4/tcp_fastopen


Looks like I'm good now:

raj@tardy-ubuntu-1204:~$ cat /proc/sys/net/ipv4/tcp_fastopen
3
raj@tardy-ubuntu-1204:~$ sudo tcpdump -c 30 -i lo 'not port 12865'
[sudo] password for raj:
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on lo, link-type EN10MB (Ethernet), capture size 65535 bytes
14:28:43.536166 IP localhost.29105 > localhost.srvr: Flags [S], seq 
378007190, win 43690, options [mss 65495,sackOK,TS val 4294942097 ecr 
0,nop,wscale 7,Unknown Option 254f989], length 0
14:28:43.536191 IP localhost.srvr > localhost.29105: Flags [S.], seq 
2030806688, ack 378007191, win 43690, options [mss 65495,sackOK,TS val 
4294942097 ecr 4294942097,nop,wscale 7,Unknown Option 
254f989e73dc061f14d850e], length 0
14:28:43.537421 IP localhost.29105 > localhost.srvr: Flags [P.], seq 
1:2, ack 1, win 342, options [nop,nop,TS val 4294942098 ecr 4294942097], 
length 1
14:28:43.537445 IP localhost.srvr > localhost.29105: Flags [.], ack 2, 
win 342, options [nop,nop,TS val 4294942098 ecr 4294942098], length 0
14:28:43.537525 IP localhost.srvr > localhost.29105: Flags [P.], seq 
1:2, ack 2, win 342, options [nop,nop,TS val 4294942098 ecr 4294942098], 
length 1
14:28:43.537542 IP localhost.srvr > localhost.29105: Flags [F.], seq 2, 
ack 2, win 342, options [nop,nop,TS val 4294942098 ecr 4294942098], length 0
14:28:43.537727 IP localhost.29105 > localhost.srvr: Flags [F.], seq 2, 
ack 3, win 342, options [nop,nop,TS val 4294942098 ecr 4294942098], length 0
14:28:43.537741 IP localhost.srvr > localhost.29105: Flags [.], ack 3, 
win 342, options [nop,nop,TS val 4294942098 ecr 4294942098], length 0
14:28:43.537895 IP localhost.29106 > localhost.srvr: Flags [S], seq 
1735077945:1735077946, win 43690, options [mss 65495,sackOK,TS val 
4294942098 ecr 0,nop,wscale 7,Unknown Option 254f989e73dc061f14d850e], 
length 1
14:28:43.537909 IP localhost.srvr > localhost.29106: Flags [S.], seq 
1983728126, ack 1735077947, win 43690, options [mss 65495,sackOK,TS val 
4294942098 ecr 4294942098,nop,wscale 7], length 0
14:28:43.537924 IP localhost.29106 > localhost.srvr: Flags [.], ack 1, 
win 342, options [nop,nop,TS val 4294942098 ecr 4294942098], length 0
14:28:43.538014 IP localhost.srvr > localhost.29106: Flags [P.], seq 
1:2, ack 1, win 342, options [nop,nop,TS val 4294942098 ecr 4294942098], 
length 1
14:28:43.538028 IP localhost.srvr > localhost.29106: Flags [F.], seq 2, 
ack 1, win 342, options [nop,nop,TS val 4294942098 ecr 4294942098], length 0
14:28:43.538196 IP localhost.29106 > localhost.srvr: Flags [.], ack 2, 
win 342, options [nop,nop,TS val 4294942098 ecr 4294942098], length 0
14:28:43.538288 IP localhost.29106 > localhost.srvr: Flags [F.], seq 1, 
ack 3, win 342, options [nop,nop,TS val 4294942098 ecr 4294942098], length 0
14:28:43.538300 IP localhost.srvr > localhost.29106: Flags [.], ack 2, 
win 342, options [nop,nop,TS val 4294942098 ecr 4294942098], length 0
14:28:43.538417 IP localhost.29107 > localhost.srvr: Flags [S], seq 
3902541042:3902541043, win 43690, options [mss 65495,sackOK,TS val 
4294942098 ecr 0,nop,wscale 7,Unknown Option 254f989e73dc061f14d850e], 
length 1
14:28:43.538431 IP localhost.srvr > localhost.29107: Flags [S.], seq 
941945820, ack 3902541044, win 43690, options [mss 65495,sackOK,TS val 
4294942098 ecr 4294942098,nop,wscale 7], length 0
14:28:43.538445 IP localhost.29107 > localhost.srvr: Flags [.], ack 1, 
win 342, options [nop,nop,TS val 4294942098 ecr 4294942098], length 0
14:28:43.538518 IP localhost.srvr > localhost.29107: Flags [P.], seq 
1:2, ack 1, win 342, options [nop,nop,TS val 4294942098 ecr 4294942098], 
length 1
14:28:43.538531 IP localhost.srvr > localhost.29107: Flags [F.], seq 2, 
ack 1, win 342, options [nop,nop,TS val 4294942098 ecr 4294942098], length 0
14:28:43.538654 IP localhost.29107 > localhost.srvr: Flags [.], ack 2, 
win 342, options [nop,nop,TS val 4294942098 ecr 4294942098], length 0
14:28:43.538730 IP localhost.29107 > localhost.srvr: Flags [F.], seq 1, 
ack 3, win 342, options [nop,nop,TS val 4294942098 ecr 4294942098], length 0
14:28:43.538742 IP localhost.srvr > localhost.29107: Flags [.], ack 2, 
win 342, options [nop,nop,TS val 4294942098 ecr 4294942098], length 0

in which case I suppose that means that netperf top-of-trunk does indeed 
have client and server side support for TCP_FASTOPEN.  Enabled via the 
test-specific -F option though in the loopback test (in a 1 VCPU VM) I 
don't see much of a difference (don't suppose I should really):

raj@tardy-ubuntu-1204:~/netperf2_trunk/src$ ./netperf -t TCP_CRR -l 30 
-i 30,3 -- -F -P ,12345
MIGRATED TCP Connect/Request/Response TEST from 0.0.0.0 (0.0.0.0) port 0 
AF_INET to localhost () port 12345 AF_INET : +/-2.500% @ 99% conf.  : demo
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate
bytes  Bytes  bytes    bytes   secs.    per sec

16384  87380  1        1       30.00    15909.07
16384  87380
raj@tardy-ubuntu-1204:~/netperf2_trunk/src$ ./netperf -t TCP_CRR -l 30 
-i 30,3 --  -P ,12345
MIGRATED TCP Connect/Request/Response TEST from 0.0.0.0 (0.0.0.0) port 0 
AF_INET to localhost () port 12345 AF_INET : +/-2.500% @ 99% conf.  : demo
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate
bytes  Bytes  bytes    bytes   secs.    per sec

16384  87380  1        1       30.00    15574.37
16384  87380

happy benchmarking,

rick jones

^ permalink raw reply

* Re: Launch Time Support
From: Ulf samuelsson @ 2012-12-17 22:57 UTC (permalink / raw)
  To: Vick, Matthew, netdev@vger.kernel.org
In-Reply-To: <06DFBC1E25D8024DB214DC7F41A3CD344897485E@ORSMSX101.amr.corp.intel.com>



17 dec 2012 kl. 22:44 skrev "Vick, Matthew" <matthew.vick@intel.com>:

>> -----Original Message-----
>> From: Ulf samuelsson [mailto:netdev@emagii.com]
>> Sent: Friday, December 14, 2012 11:35 PM
>> To: Vick, Matthew
>> Cc: netdev@vger.kernel.org
>> Subject: Re: Launch Time Support
>> 
>> 
>> 15 dec 2012 kl. 01:45 skrev "Vick, Matthew" <matthew.vick@intel.com>:
>> 
>>>> -----Original Message-----
>>>> From: netdev-owner@vger.kernel.org [mailto:netdev-
>>>> owner@vger.kernel.org] On Behalf Of Ulf Samuelsson
>>>> Sent: Wednesday, December 12, 2012 5:04 PM
>>>> To: netdev@vger.kernel.org
>>>> Subject: RFC: Launch Time Support
>>>> 
>>>> Hi, I am looking for some feedback on how to implement launchtime in
>>>> the kernel.
>>>> 
>>>> I.E: You define WHEN you want to send a packet, and the driver will
>>>> store the packet in a buffer and will send it out on the net when
>> the
>>>> internal timestamp counter in the network controller reaches the
>>>> specified "launch time".
>>>> 
>>>> Some Ethernet controllers like the new Intel i210 support "launch
>>>> time",
>>>> 
>>>> Support for launch time is desirable for any isochronous connection,
>>>> but I am currently interested in the NTP protocol to improve the
>>>> timing.
>>>> 
>>>> Proposed Changes to the Kernel
>>>> ===========================================================
>>>> The launchtime support will be dependent on CONFIG_NET_LAUNCHTIME If
>>>> this is not set, then the kernel functionality is not changed.
>>>> 
>>>> My current idea is to add a new bit to the "flags" field of
>>>> "socket.c:sendto"
>>>> #define MSG_LAUNCHTIME 0x?????
>>>> 
>>>> struct msghdr gets an additional launchtime field.
>>>> 
>>>> sendto will check if the flags parameter contains MSG_LAUNCHTIME.
>>>> If it does, then the first 64 bit longword of the packet (buff)
>>>> contains the launchtime.
>>>> The launchtime from the buffer is copied to the msghdr.launchtime
>>>> field, and the first 64 bits of the packet is then shaved off,
>> before
>>>> the address is written to the msghdr.
>>>> 
>>>> Each network controller supporting launchtime needs to have an
>>>> alternative call to "send packet with launchtime" . This call adds
>>>> the launchtime parameter.
>>>> If launchtime is supported the exported "ops" includes the new call.
>>>> 
>>>> The UDP/IP packet send will check the MSG_LAUNCHTIME and if set, it
>>>> will check if the "send packet with launchtime" call is available
>> for
>>>> the driver and if so call it, otherwise it will call the normal send
>>>> packet and thus ignore the launchtime.
>>>> 
>>>> Before launchtime is used, the application should send an ioctl to
>>>> the driver, making sure that launchtime is configured, and only if
>>>> the driver ACKs , the application will use launchtime.
>>>> 
>>>> (Possibly the "ops" field for "send packet with launchtime" should
>> be
>>>> NULL until that ioctl is complete. Comments?)
>>>> 
>>>> To me, this seems to be transparent for all other network stacks so
>>>> protocols and drivers not supporting launchtime should still work.
>>>> 
>>>> As far as I know, drivers do not support launch time today.
>>>> The Intel igb driver does not in the latest version on the intel web
>>>> site, There are some defines headers in the latest version  defining
>>>> the registers but so far, the code is not using it.
>>>> 
>>>> There is the linux_igb_avb project on sourceforge which  allows use
>>>> of launch time for user space applications, but not as part of the
>> kernel.
>>>> 
>>>> Maybe there is more work done somewhere else, but i am not aware of
>>>> this, so any links to such work is appreciated.
>>>> 
>>>> There are some FPGA based PCIe boards that support launchtime
>> (Endace
>>>> DAG) using proprietary APIs.
>>>> Talked to some vendors providing TCP/IP offload engines for FPGA and
>>>> they do not support launchtime and liuke Endace use proprietary APIs
>>>> so they are only useable by custom programs. Normal networking
>>>> interfaces are not supported.
>>>> 
>>>> Comment on above is appreciated.
>>>> 
>>>> BACKGROUND
>>>> For those that do not know how the NTP protocol works:
>>>> ===================================================
>>>> The client sends an UDP packet to the NTP server using port 123 The
>>>> NTP client reads the current systime and puts that in the outgoing
>> packet.
>>>> There is a delay between the time the systime is read, and the time
>>>> the packet actually leaves the Ethernet controller adding jitter to
>>>> the NTP algorithm.
>>>> 
>>>> When the server receives the packet, it can be timestamped in H/W
>> and
>>>> a CMSG is then created by the network stack containing that
>> timestamp
>>>> for use by the server NTP daemon.
>>>> 
>>>> The server generates a reply, which needs to include the client
>>>> transmit time, the servers receive time, and the servers transmit
>> time.
>>>> Again, the transmit time needs to be written into the NTP packet,
>> and
>>>> then it needs to be processed through the network stack before it is
>>>> leaving the ethernet controller causing more jitter.
>>>> 
>>>> If launch time is supported, then the client NTP daemon would simply
>>>> read the systime, add a constant delay to create the transmit
>>>> timestamp.
>>>> The delay needs to be sufficiently large to ensure that all
>>>> processing is done,
>>>> 
>>>> The server will do something similar adding a constant to the server
>>>> receive timestamp to create the server transmit timestamp.
>>>> If both the client and the server uses H/W timestamping and launch
>>>> time, then the the jitter ideally is reduced to zero.
>>>> 
>>>> TRANSMIT TIMESTAMPING
>>>> ========================
>>>> Support for TX timestamps in H/W is not really useful, since you
>> need
>>>> to provide the TX timestamp in the packet you measure on, so when
>> you
>>>> know the timestamp it is too late. Server to server  NTP connections
>>>> support sending that timestamp in a new packet, but there is no such
>>>> support in client server communication.
>>>> 
>>>> The i210 supports putting the timestamp inside the packet as it
>>>> leaves the Ethernet controller, but that means that you screw up the
>>>> UDP checksum, so the packet will be rejected by the receiving NTP
>> daemon.
>>>> In addition, the i210 timestamp measures seconds and nanoseconds
>>>> which is incompatible with the NTP timestamp which uses seconds and
>> a
>>>> 32 bit fraction of a second so that does not work either.
>>>> 
>>>> Best Regards
>>>> Ulf Samuelsson
>>>> eMagii.
>>> 
>>> Ulf,
>>> 
>>> I have been looking into adding launch time support as part of
>> enabling some of the I210 functionality you have described (such as in
>> linux_igb_avb on SourceForge) upstream--less focused on NTP and more
>> focused on AVB, but launch time will be necessary for both. If you
>> would like, please feel free to contact me and I would love to work
>> with you on this.
>>> 
>>> Reading your proposal, I'm a little confused by which systime you're
>> referring to. Do you mean on the host or on the NIC? In the case of
>> hardware timestamping today, in igb we set the SYSTIM registers to the
>> current system time, but that doesn't mean that the host clock and the
>> NIC clock stay synced. You would either need a mechanism such as PPS
>> (which igb does not implement today) to sync the host clock to the NIC
>> clock or have the NTP daemon account for the discrepancy. Off the top
>> of my head, I want to say modern PTP daemons (such as ptp4l) account
>> for the discrepancy in the daemon.
>>> 
>>> Cheers,
>>> Matthew
>> 
>> We live in luxury, having access to a Cesium Clock ;-) and we define
>> the time, beeing a top-level (Stratum 1) server.
>> 
>> There are some I/Os on the i210 that can be used to interface to the
>> PPS.
>> 
>> As for reading systime, it is done indirectly as you get the systime as
>> part of the NTP incoming packet. (It is timestamped at reception) and
>> add the constant to that value.
>> 
>> Best Regards
>> Ulf Samuelsson
> 
> So your proposal is to use a PPS interface (from some Stratum 1 server) to drive the clock on an I210 so you can use the I210's launch time mechanism to send packets at a certain time--is this correct? Or are you talking more about a software launch time solution?

There are 4 I/O pins, and you can capture the timestamp when they change value.
This can be used to calibrate the timestamp vs the 1 PPS clock.
It is far from ideal, but it might just work.

If you can clock the part from a 5 MHz clock from the Cesium clock, that would be even better.




> 
> Forgive my ignorance, but what constant are you referring to in the receive path? Based on your first e-mail, you mention the constant should be added to the transmit path.

It is an arbitrary time constant you select in the range of 10s of milliseconds.
The only requirement is that the time from receiving the incoming packet,
to the time when the reply leaves the machine is always shorter than the time constant.
We have seen that this has been implemented on a 10 GbE FPGA card using 50 ms.
There is no requirement for short latency in NTP, only predictable latency.

> 
> Also, how will you account for hardware discrepancies? For example, how far in the future you can schedule a packet will differ from hardware to hardware.

We don't have the problem, since it is for internal use, and we will use the same H/W
In all nodes.  Possibly you want to be able to query the driver for your use.


Best Regards
Ulf Samuelsson
ulf@emagii.com


> 
> Cheers,
> Matthew
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH] tuntap: fix sparse warning
From: Jason Wang @ 2012-12-18  3:00 UTC (permalink / raw)
  To: davem, netdev, linux-kernel; +Cc: fengguang.wu, Jason Wang

Make tun_enable_queue() static to fix the sparse warning:

drivers/net/tun.c:399:19: sparse: symbol 'tun_enable_queue' was not declared. Should it be static?

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/tun.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 173acf5..504f7f1 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -396,7 +396,7 @@ static void tun_disable_queue(struct tun_struct *tun, struct tun_file *tfile)
 	++tun->numdisabled;
 }
 
-struct tun_struct *tun_enable_queue(struct tun_file *tfile)
+static struct tun_struct *tun_enable_queue(struct tun_file *tfile)
 {
 	struct tun_struct *tun = tfile->detached;
 
-- 
1.7.1

^ permalink raw reply related

* [PATCH v4] netfilter: nf_conntrack_sip: Handle Cisco 7941/7945 IP phones
From: Kevin Cernekee @ 2012-12-18  4:33 UTC (permalink / raw)
  To: pablo
  Cc: David Woodhouse, Eric Dumazet, Patrick McHardy, David S. Miller,
	Alexey Kuznetsov, Pekka Savola (ipv6), James Morris,
	Hideaki YOSHIFUJI, Gabor Juhos, netfilter-devel, netfilter,
	coreteam, linux-kernel, netdev

Most SIP devices use a source port of 5060/udp on SIP requests, so the
response automatically comes back to port 5060:

    phone_ip:5060 -> proxy_ip:5060   REGISTER
    proxy_ip:5060 -> phone_ip:5060   100 Trying

The newer Cisco IP phones, however, use a randomly chosen high source
port for the SIP request but expect the response on port 5060:

    phone_ip:49173 -> proxy_ip:5060  REGISTER
    proxy_ip:5060 -> phone_ip:5060   100 Trying

Standard Linux NAT, with or without nf_nat_sip, will send the reply back
to port 49173, not 5060:

    phone_ip:49173 -> proxy_ip:5060  REGISTER
    proxy_ip:5060 -> phone_ip:49173  100 Trying

But the phone is not listening on 49173, so it will never see the reply.

This patch modifies nf_*_sip to work around this quirk by extracting
the SIP response port from the Via: header, iff the source IP in the
packet header matches the source IP in the SIP request.

Signed-off-by: Kevin Cernekee <cernekee@gmail.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Patrick McHardy <kaber@trash.net>
---


Baseline: git://1984.lsi.us.es/nf-next

v3->v4 changes:

Fix patch context and APIs to match the current Linux tree.  These
changes are from OpenWRT (Gabor?) and David W.

v4 was tested with Cisco 7945 (high UDP destination port) and Snom m9
(normal "symmetric" UDP destination port), both on IPv4 only.

I've been running a recent OpenWRT port of this patch (Attitude Adjustment
release, 3.3 kernel) for ~2mo, with both phones as clients.


 include/linux/netfilter/nf_conntrack_sip.h |    3 +++
 net/netfilter/nf_conntrack_sip.c           |   17 +++++++++++++++++
 net/netfilter/nf_nat_sip.c                 |   27 ++++++++++++++++++++++++---
 3 files changed, 44 insertions(+), 3 deletions(-)

diff --git a/include/linux/netfilter/nf_conntrack_sip.h b/include/linux/netfilter/nf_conntrack_sip.h
index 387bdd0..ba7f571 100644
--- a/include/linux/netfilter/nf_conntrack_sip.h
+++ b/include/linux/netfilter/nf_conntrack_sip.h
@@ -4,12 +4,15 @@
 
 #include <net/netfilter/nf_conntrack_expect.h>
 
+#include <linux/types.h>
+
 #define SIP_PORT	5060
 #define SIP_TIMEOUT	3600
 
 struct nf_ct_sip_master {
 	unsigned int	register_cseq;
 	unsigned int	invite_cseq;
+	__be16		forced_dport;
 };
 
 enum sip_expectation_classes {
diff --git a/net/netfilter/nf_conntrack_sip.c b/net/netfilter/nf_conntrack_sip.c
index df8f4f2..72a67bb 100644
--- a/net/netfilter/nf_conntrack_sip.c
+++ b/net/netfilter/nf_conntrack_sip.c
@@ -1440,8 +1440,25 @@ static int process_sip_request(struct sk_buff *skb, unsigned int protoff,
 {
 	enum ip_conntrack_info ctinfo;
 	struct nf_conn *ct = nf_ct_get(skb, &ctinfo);
+	struct nf_ct_sip_master *ct_sip_info = nfct_help_data(ct);
+	enum ip_conntrack_dir dir = CTINFO2DIR(ctinfo);
 	unsigned int matchoff, matchlen;
 	unsigned int cseq, i;
+	union nf_inet_addr addr;
+	__be16 port;
+
+	/* Many Cisco IP phones use a high source port for SIP requests, but
+	 * listen for the response on port 5060.  If we are the local
+	 * router for one of these phones, save the port number from the
+	 * Via: header so that nf_nat_sip can redirect the responses to
+	 * the correct port.
+	 */
+	if (ct_sip_parse_header_uri(ct, *dptr, NULL, *datalen,
+				    SIP_HDR_VIA_UDP, NULL, &matchoff,
+				    &matchlen, &addr, &port) > 0 &&
+	    port != ct->tuplehash[dir].tuple.src.u.udp.port &&
+	    nf_inet_addr_cmp(&addr, &ct->tuplehash[dir].tuple.src.u3))
+		ct_sip_info->forced_dport = port;
 
 	for (i = 0; i < ARRAY_SIZE(sip_handlers); i++) {
 		const struct sip_handler *handler;
diff --git a/net/netfilter/nf_nat_sip.c b/net/netfilter/nf_nat_sip.c
index 16303c7..5951146e 100644
--- a/net/netfilter/nf_nat_sip.c
+++ b/net/netfilter/nf_nat_sip.c
@@ -95,6 +95,7 @@ static int map_addr(struct sk_buff *skb, unsigned int protoff,
 	enum ip_conntrack_info ctinfo;
 	struct nf_conn *ct = nf_ct_get(skb, &ctinfo);
 	enum ip_conntrack_dir dir = CTINFO2DIR(ctinfo);
+	struct nf_ct_sip_master *ct_sip_info = nfct_help_data(ct);
 	char buffer[INET6_ADDRSTRLEN + sizeof("[]:nnnnn")];
 	unsigned int buflen;
 	union nf_inet_addr newaddr;
@@ -107,7 +108,8 @@ static int map_addr(struct sk_buff *skb, unsigned int protoff,
 	} else if (nf_inet_addr_cmp(&ct->tuplehash[dir].tuple.dst.u3, addr) &&
 		   ct->tuplehash[dir].tuple.dst.u.udp.port == port) {
 		newaddr = ct->tuplehash[!dir].tuple.src.u3;
-		newport = ct->tuplehash[!dir].tuple.src.u.udp.port;
+		newport = ct_sip_info->forced_dport ? :
+			  ct->tuplehash[!dir].tuple.src.u.udp.port;
 	} else
 		return 1;
 
@@ -144,6 +146,7 @@ static unsigned int nf_nat_sip(struct sk_buff *skb, unsigned int protoff,
 	enum ip_conntrack_info ctinfo;
 	struct nf_conn *ct = nf_ct_get(skb, &ctinfo);
 	enum ip_conntrack_dir dir = CTINFO2DIR(ctinfo);
+	struct nf_ct_sip_master *ct_sip_info = nfct_help_data(ct);
 	unsigned int coff, matchoff, matchlen;
 	enum sip_header_types hdr;
 	union nf_inet_addr addr;
@@ -258,6 +261,21 @@ next:
 	    !map_sip_addr(skb, protoff, dataoff, dptr, datalen, SIP_HDR_TO))
 		return NF_DROP;
 
+	/* Mangle destination port for Cisco phones, then fix up checksums */
+	if (dir == IP_CT_DIR_REPLY && ct_sip_info->forced_dport) {
+		struct udphdr *uh;
+
+		if (!skb_make_writable(skb, skb->len))
+			return NF_DROP;
+
+		uh = (void *)skb->data + protoff;
+		uh->dest = ct_sip_info->forced_dport;
+
+		if (!nf_nat_mangle_udp_packet(skb, ct, ctinfo, protoff,
+					      0, 0, NULL, 0))
+			return NF_DROP;
+	}
+
 	return NF_ACCEPT;
 }
 
@@ -311,8 +329,10 @@ static unsigned int nf_nat_sip_expect(struct sk_buff *skb, unsigned int protoff,
 	enum ip_conntrack_info ctinfo;
 	struct nf_conn *ct = nf_ct_get(skb, &ctinfo);
 	enum ip_conntrack_dir dir = CTINFO2DIR(ctinfo);
+	struct nf_ct_sip_master *ct_sip_info = nfct_help_data(ct);
 	union nf_inet_addr newaddr;
 	u_int16_t port;
+	__be16 srcport;
 	char buffer[INET6_ADDRSTRLEN + sizeof("[]:nnnnn")];
 	unsigned int buflen;
 
@@ -326,8 +346,9 @@ static unsigned int nf_nat_sip_expect(struct sk_buff *skb, unsigned int protoff,
 	/* If the signalling port matches the connection's source port in the
 	 * original direction, try to use the destination port in the opposite
 	 * direction. */
-	if (exp->tuple.dst.u.udp.port ==
-	    ct->tuplehash[dir].tuple.src.u.udp.port)
+	srcport = ct_sip_info->forced_dport ? :
+		  ct->tuplehash[dir].tuple.src.u.udp.port;
+	if (exp->tuple.dst.u.udp.port == srcport)
 		port = ntohs(ct->tuplehash[!dir].tuple.dst.u.udp.port);
 	else
 		port = ntohs(exp->tuple.dst.u.udp.port);
-- 
1.7.5.4

^ permalink raw reply related

* Re: [PATCH] tuntap: fix sparse warning
From: David Miller @ 2012-12-18  4:49 UTC (permalink / raw)
  To: jasowang; +Cc: netdev, linux-kernel, fengguang.wu
In-Reply-To: <1355799627-55529-1-git-send-email-jasowang@redhat.com>

From: Jason Wang <jasowang@redhat.com>
Date: Tue, 18 Dec 2012 11:00:27 +0800

> Make tun_enable_queue() static to fix the sparse warning:
> 
> drivers/net/tun.c:399:19: sparse: symbol 'tun_enable_queue' was not declared. Should it be static?
> 
> Reported-by: Fengguang Wu <fengguang.wu@intel.com>
> Signed-off-by: Jason Wang <jasowang@redhat.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH] netlink: change presentation of portid in procfs to unsigned
From: David Miller @ 2012-12-18  4:51 UTC (permalink / raw)
  To: hannes; +Cc: netdev
In-Reply-To: <20121216010919.GA1528@order.stressinduktion.org>

From: Hannes Frederic Sowa <hannes@stressinduktion.org>
Date: Sun, 16 Dec 2012 02:09:19 +0100

> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>

Applied.

^ permalink raw reply

* Re: [PATCH] netlink: validate addr_len on bind
From: David Miller @ 2012-12-18  4:52 UTC (permalink / raw)
  To: hannes; +Cc: netdev
In-Reply-To: <20121216014219.GB1528@order.stressinduktion.org>

From: Hannes Frederic Sowa <hannes@stressinduktion.org>
Date: Sun, 16 Dec 2012 02:42:19 +0100

> Otherwise an out of bounds read could happen.
> 
> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>

Applied.

^ permalink raw reply

* Re: [PATCH] atm: use scnprintf() instead of sprintf()
From: David Miller @ 2012-12-18  4:52 UTC (permalink / raw)
  To: chas; +Cc: netdev, gang.chen
In-Reply-To: <20121217110001.3eaf3ac5@thirdoffive.cmf.nrl.navy.mil>

From: chas williams - CONTRACTOR <chas@cmf.nrl.navy.mil>
Date: Mon, 17 Dec 2012 11:00:01 -0500

> 
> As reported by Chen Gang <gang.chen@asianux.com>, we should ensure there
> is enough space when formatting the sysfs buffers.
> 
> Signed-off-by: Chas Williams <chas@cmf.nrl.navy.mil>

Applied.

^ permalink raw reply

* Re: [PATCH 1/2 v2] qmi_wwan/cdc_ether: add Dell Wireless 5800 (Novatel E362) USB IDs
From: David Miller @ 2012-12-18  4:52 UTC (permalink / raw)
  To: dcbw; +Cc: bjorn, aleksander, netdev
In-Reply-To: <1355768261.1424.50.camel@dcbw.foobar.com>

From: Dan Williams <dcbw@redhat.com>
Date: Mon, 17 Dec 2012 12:17:41 -0600

> Signed-off-by: Dan Williams <dcbw@redhat.com>

Applied.

^ permalink raw reply

* Re: [PATCH 2/2 v2] cdc_ether: cleanup: use USB_DEVICE_AND_INTERFACE_INFO for Novatel 551/E362
From: David Miller @ 2012-12-18  4:52 UTC (permalink / raw)
  To: dcbw; +Cc: bjorn, aleksander, netdev
In-Reply-To: <1355768386.1424.52.camel@dcbw.foobar.com>

From: Dan Williams <dcbw@redhat.com>
Date: Mon, 17 Dec 2012 12:19:46 -0600

> Signed-off-by: Dan Williams <dcbw@redhat.com>

Applied.

^ permalink raw reply

* [RFC][PATCH] ipv6 multicast forwarding: Remove threshold checking and some trivial bugs
From: Ang Way Chuang @ 2012-12-18  4:57 UTC (permalink / raw)
  To: netdev; +Cc: yoshfuji

This patch fixes trivial bugs for IPv6 multicast forwarding code and remove
threshold checking for multicast forwarding cache.

1. Threshold checking in IPv6 multicast forwarding cache (MFC) was not properly implemented.
syscall to setsockopt(... MRT6_ADD_MIF,...) doesn't affect the TTL because it is never used.
In fact, all MFC will always have ttl of 1 as set by ip6mr_mfc_add. From my limited knowledge of
multicast routing, threshold setting on interface is only used by DVMRP which doesn't support
IPv6. FreeBSD's struct mif6ctl doesn't have vifc_threshold. This patch removes the ttl cruft
within kernel. Userspace ABI for backward compatibility. Can someone knowledgable in multicast
routing please verify whether my understanding is correct?
2. Don't allow addition of MFC with non-existent multicast interface index.
3. Don't allow addition of MFC where incoming interface is part of oif list. This does not make
   sense. Why would we want to send a multicast back to the interface where it originates from.
4. setsockopt(....MRT6_ADD_MIF, ) allows a "physical" interface to be registered as multicast 
   interface multiple times. This doesn't make sense. Don't allow registration duplicate 
   registration of the same "physical" interface. 

This patch has been tested, albeit minimally using a simple program. Is this patch okay for
inclusion? Will sign off if it is okay.


---
 include/linux/mroute6.h |    3 +-
 net/ipv6/ip6mr.c        |   79 +++++++++++++++++++++++++----------------------
 2 files changed, 43 insertions(+), 39 deletions(-)

diff --git a/include/linux/mroute6.h b/include/linux/mroute6.h
index a223561..88a79d8 100644
--- a/include/linux/mroute6.h
+++ b/include/linux/mroute6.h
@@ -66,7 +66,6 @@ struct mif_device {
 	unsigned long	bytes_in,bytes_out;
 	unsigned long	pkt_in,pkt_out;		/* Statistics 			*/
 	unsigned long	rate_limit;		/* Traffic shaping (NI) 	*/
-	unsigned char	threshold;		/* TTL threshold 		*/
 	unsigned short	flags;			/* Control flags 		*/
 	int		link;			/* Physical interface index	*/
 };
@@ -92,7 +91,7 @@ struct mfc6_cache {
 			unsigned long bytes;
 			unsigned long pkt;
 			unsigned long wrong_if;
-			unsigned char ttls[MAXMIFS];	/* TTL thresholds		*/
+			struct if_set mf6c_ifset;	/* Where it is going */
 		} res;
 	} mfc_un;
 };
diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
index 26dcdec..0a12fe4 100644
--- a/net/ipv6/ip6mr.c
+++ b/net/ipv6/ip6mr.c
@@ -122,6 +122,7 @@ static int ip6mr_rtm_dumproute(struct sk_buff *skb,
 			       struct netlink_callback *cb);
 static void mroute_clean_tables(struct mr6_table *mrt);
 static void ipmr_expire_process(unsigned long arg);
+static int ip6mr_find_vif(struct mr6_table *mrt, struct net_device *dev);
 
 #ifdef CONFIG_IPV6_MROUTE_MULTIPLE_TABLES
 #define ip6mr_for_each_table(mrt, net) \
@@ -574,10 +575,10 @@ static int ipmr_mfc_seq_show(struct seq_file *seq, void *v)
 			for (n = mfc->mfc_un.res.minvif;
 			     n < mfc->mfc_un.res.maxvif; n++) {
 				if (MIF_EXISTS(mrt, n) &&
-				    mfc->mfc_un.res.ttls[n] < 255)
+				    IF_ISSET(n, &mfc->mfc_un.res.mf6c_ifset))
 					seq_printf(seq,
 						   " %2d:%-3d",
-						   n, mfc->mfc_un.res.ttls[n]);
+						   n, 1);
 			}
 		} else {
 			/* unresolved mfc_caches don't contain
@@ -895,28 +896,6 @@ static void ipmr_expire_process(unsigned long arg)
 	spin_unlock(&mfc_unres_lock);
 }
 
-/* Fill oifs list. It is called under write locked mrt_lock. */
-
-static void ip6mr_update_thresholds(struct mr6_table *mrt, struct mfc6_cache *cache,
-				    unsigned char *ttls)
-{
-	int vifi;
-
-	cache->mfc_un.res.minvif = MAXMIFS;
-	cache->mfc_un.res.maxvif = 0;
-	memset(cache->mfc_un.res.ttls, 255, MAXMIFS);
-
-	for (vifi = 0; vifi < mrt->maxvif; vifi++) {
-		if (MIF_EXISTS(mrt, vifi) &&
-		    ttls[vifi] && ttls[vifi] < 255) {
-			cache->mfc_un.res.ttls[vifi] = ttls[vifi];
-			if (cache->mfc_un.res.minvif > vifi)
-				cache->mfc_un.res.minvif = vifi;
-			if (cache->mfc_un.res.maxvif <= vifi)
-				cache->mfc_un.res.maxvif = vifi + 1;
-		}
-	}
-}
 
 static int mif6_add(struct net *net, struct mr6_table *mrt,
 		    struct mif6ctl *vifc, int mrtsock)
@@ -955,6 +934,12 @@ static int mif6_add(struct net *net, struct mr6_table *mrt,
 		dev = dev_get_by_index(net, vifc->mif6c_pifi);
 		if (!dev)
 			return -EADDRNOTAVAIL;
+
+		if (ip6mr_find_vif(mrt, dev) >= 0) {
+			dev_put(dev);
+			return -EADDRINUSE;
+		}
+
 		err = dev_set_allmulti(dev, 1);
 		if (err) {
 			dev_put(dev);
@@ -980,7 +965,6 @@ static int mif6_add(struct net *net, struct mr6_table *mrt,
 	v->flags = vifc->mif6c_flags;
 	if (!mrtsock)
 		v->flags |= VIFF_STATIC;
-	v->threshold = vifc->vifc_threshold;
 	v->bytes_in = 0;
 	v->bytes_out = 0;
 	v->pkt_in = 0;
@@ -1393,22 +1377,37 @@ void ip6_mr_cleanup(void)
 static int ip6mr_mfc_add(struct net *net, struct mr6_table *mrt,
 			 struct mf6cctl *mfc, int mrtsock)
 {
+	int minvif = MAXMIFS;
+	int maxvif = 0;
 	bool found = false;
 	int line;
 	struct mfc6_cache *uc, *c;
-	unsigned char ttls[MAXMIFS];
 	int i;
 
-	if (mfc->mf6cc_parent >= MAXMIFS)
+	if (mfc->mf6cc_parent >= MAXMIFS || !MIF_EXISTS(mrt, mfc->mf6cc_parent))
 		return -ENFILE;
 
-	memset(ttls, 255, MAXMIFS);
-	for (i = 0; i < MAXMIFS; i++) {
-		if (IF_ISSET(i, &mfc->mf6cc_ifset))
-			ttls[i] = 1;
+	/* incoming interface should not be part of outgoing interfaces, doing so
+	 * will cause duplicate
+	 */
+	if (IF_ISSET(mfc->mf6cc_parent, &mfc->mf6cc_ifset))
+		return -EINVAL;
 
+	for (i = 0; i < MAXMIFS; i++) {
+		if (IF_ISSET(i, &mfc->mf6cc_ifset)) {
+			if (!MIF_EXISTS(mrt, i))
+				return -ENFILE;
+
+			if (minvif > i)
+				minvif = i;
+			if (maxvif <= i)
+				maxvif = i + 1;
+		}
 	}
 
+	if (maxvif <= minvif)	/* mf6cc_ifset is basically empty */
+		return -EINVAL;
+
 	line = MFC6_HASH(&mfc->mf6cc_mcastgrp.sin6_addr, &mfc->mf6cc_origin.sin6_addr);
 
 	list_for_each_entry(c, &mrt->mfc6_cache_array[line], list) {
@@ -1422,7 +1421,10 @@ static int ip6mr_mfc_add(struct net *net, struct mr6_table *mrt,
 	if (found) {
 		write_lock_bh(&mrt_lock);
 		c->mf6c_parent = mfc->mf6cc_parent;
-		ip6mr_update_thresholds(mrt, c, ttls);
+		c->mfc_un.res.mf6c_ifset = mfc->mf6cc_ifset;
+		c->mfc_un.res.minvif = minvif;
+		c->mfc_un.res.maxvif = maxvif;
+
 		if (!mrtsock)
 			c->mfc_flags |= MFC_STATIC;
 		write_unlock_bh(&mrt_lock);
@@ -1440,7 +1442,10 @@ static int ip6mr_mfc_add(struct net *net, struct mr6_table *mrt,
 	c->mf6c_origin = mfc->mf6cc_origin.sin6_addr;
 	c->mf6c_mcastgrp = mfc->mf6cc_mcastgrp.sin6_addr;
 	c->mf6c_parent = mfc->mf6cc_parent;
-	ip6mr_update_thresholds(mrt, c, ttls);
+	c->mfc_un.res.mf6c_ifset = mfc->mf6cc_ifset;
+	c->mfc_un.res.minvif = minvif;
+	c->mfc_un.res.maxvif = maxvif;
+
 	if (!mrtsock)
 		c->mfc_flags |= MFC_STATIC;
 
@@ -2036,7 +2041,7 @@ static int ip6_mr_forward(struct net *net, struct mr6_table *mrt,
 		       large chunk of pimd to kernel. Ough... --ANK
 		     */
 		    (mrt->mroute_do_pim ||
-		     cache->mfc_un.res.ttls[true_vifi] < 255) &&
+		     IF_ISSET(true_vifi, &cache->mfc_un.res.mf6c_ifset)) &&
 		    time_after(jiffies,
 			       cache->mfc_un.res.last_assert + MFC_ASSERT_THRESH)) {
 			cache->mfc_un.res.last_assert = jiffies;
@@ -2052,7 +2057,7 @@ static int ip6_mr_forward(struct net *net, struct mr6_table *mrt,
 	 *	Forward the frame
 	 */
 	for (ct = cache->mfc_un.res.maxvif - 1; ct >= cache->mfc_un.res.minvif; ct--) {
-		if (ipv6_hdr(skb)->hop_limit > cache->mfc_un.res.ttls[ct]) {
+		if (IF_ISSET(ct, &cache->mfc_un.res.mf6c_ifset)) {
 			if (psend != -1) {
 				struct sk_buff *skb2 = skb_clone(skb, GFP_ATOMIC);
 				if (skb2)
@@ -2143,7 +2148,7 @@ static int __ip6mr_fill_mroute(struct mr6_table *mrt, struct sk_buff *skb,
 		return -EMSGSIZE;
 
 	for (ct = c->mfc_un.res.minvif; ct < c->mfc_un.res.maxvif; ct++) {
-		if (MIF_EXISTS(mrt, ct) && c->mfc_un.res.ttls[ct] < 255) {
+		if (MIF_EXISTS(mrt, ct) && IF_ISSET(ct, &c->mfc_un.res.mf6c_ifset)) {
 			nhp = nla_reserve_nohdr(skb, sizeof(*nhp));
 			if (nhp == NULL) {
 				nla_nest_cancel(skb, mp_attr);
@@ -2151,7 +2156,7 @@ static int __ip6mr_fill_mroute(struct mr6_table *mrt, struct sk_buff *skb,
 			}
 
 			nhp->rtnh_flags = 0;
-			nhp->rtnh_hops = c->mfc_un.res.ttls[ct];
+			nhp->rtnh_hops = 1;	/* this is  broken as IPv6 does not use TTL threshold */
 			nhp->rtnh_ifindex = mrt->vif6_table[ct].dev->ifindex;
 			nhp->rtnh_len = sizeof(*nhp);
 		}

^ permalink raw reply related

* Re: [RFC][PATCH] ipv6 multicast forwarding: Remove threshold checking and some trivial bugs
From: David Miller @ 2012-12-18  5:03 UTC (permalink / raw)
  To: wcang; +Cc: netdev, yoshfuji
In-Reply-To: <50CFF7A7.1070508@sfc.wide.ad.jp>

From: Ang Way Chuang <wcang@sfc.wide.ad.jp>
Date: Tue, 18 Dec 2012 12:57:11 +0800

> This patch fixes trivial bugs for IPv6 multicast forwarding code and remove
> threshold checking for multicast forwarding cache.
> 
> 1. Threshold checking in IPv6 multicast forwarding cache (MFC) was not properly implemented.
> syscall to setsockopt(... MRT6_ADD_MIF,...) doesn't affect the TTL because it is never used.
> In fact, all MFC will always have ttl of 1 as set by ip6mr_mfc_add. From my limited knowledge of
> multicast routing, threshold setting on interface is only used by DVMRP which doesn't support
> IPv6. FreeBSD's struct mif6ctl doesn't have vifc_threshold. This patch removes the ttl cruft
> within kernel. Userspace ABI for backward compatibility. Can someone knowledgable in multicast
> routing please verify whether my understanding is correct?
> 2. Don't allow addition of MFC with non-existent multicast interface index.
> 3. Don't allow addition of MFC where incoming interface is part of oif list. This does not make
>    sense. Why would we want to send a multicast back to the interface where it originates from.
> 4. setsockopt(....MRT6_ADD_MIF, ) allows a "physical" interface to be registered as multicast 
>    interface multiple times. This doesn't make sense. Don't allow registration duplicate 
>    registration of the same "physical" interface. 
> 
> This patch has been tested, albeit minimally using a simple program. Is this patch okay for
> inclusion? Will sign off if it is okay.

How about we don't mix together a set of bug fixes, with a semantic
change (the removal of the threshold checking)?

I also don't see what the point is of not signing off on this change
when you submit it.

If you delay the signoff until after review, you're just causing it to
take longer to have your changes integrated.  It also makes it look
like you didn't believe fully in your change, so you probably should
have sent it as an RFC and listed your doubts in the email instead.

Overall I would rate this as an extremely poor patch submission,
sorry.

^ permalink raw reply

* Re: [RFC][PATCH] ipv6 multicast forwarding: Remove threshold checking and some trivial bugs
From: Ang Way Chuang @ 2012-12-18  5:19 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, yoshfuji
In-Reply-To: <20121217.210333.1952508082741483861.davem@davemloft.net>

Oops, sorry. You're right. I am not very confident with this modification. It may break some multicast
routing daemon. Let's drop this for now.

On 18/12/2012 13:03, David Miller wrote:
> From: Ang Way Chuang <wcang@sfc.wide.ad.jp>
> Date: Tue, 18 Dec 2012 12:57:11 +0800
> 
>> This patch fixes trivial bugs for IPv6 multicast forwarding code and remove
>> threshold checking for multicast forwarding cache.
>>
>> 1. Threshold checking in IPv6 multicast forwarding cache (MFC) was not properly implemented.
>> syscall to setsockopt(... MRT6_ADD_MIF,...) doesn't affect the TTL because it is never used.
>> In fact, all MFC will always have ttl of 1 as set by ip6mr_mfc_add. From my limited knowledge of
>> multicast routing, threshold setting on interface is only used by DVMRP which doesn't support
>> IPv6. FreeBSD's struct mif6ctl doesn't have vifc_threshold. This patch removes the ttl cruft
>> within kernel. Userspace ABI for backward compatibility. Can someone knowledgable in multicast
>> routing please verify whether my understanding is correct?
>> 2. Don't allow addition of MFC with non-existent multicast interface index.
>> 3. Don't allow addition of MFC where incoming interface is part of oif list. This does not make
>>    sense. Why would we want to send a multicast back to the interface where it originates from.
>> 4. setsockopt(....MRT6_ADD_MIF, ) allows a "physical" interface to be registered as multicast 
>>    interface multiple times. This doesn't make sense. Don't allow registration duplicate 
>>    registration of the same "physical" interface. 
>>
>> This patch has been tested, albeit minimally using a simple program. Is this patch okay for
>> inclusion? Will sign off if it is okay.
> 
> How about we don't mix together a set of bug fixes, with a semantic
> change (the removal of the threshold checking)?
> 
> I also don't see what the point is of not signing off on this change
> when you submit it.
> 
> If you delay the signoff until after review, you're just causing it to
> take longer to have your changes integrated.  It also makes it look
> like you didn't believe fully in your change, so you probably should
> have sent it as an RFC and listed your doubts in the email instead.
> 
> Overall I would rate this as an extremely poor patch submission,
> sorry.
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* [PATCH 1/2] be2net: fix be_close() to ensure all events are ack'ed
From: Sathya Perla @ 2012-12-18  5:38 UTC (permalink / raw)
  To: netdev; +Cc: Sathya Perla

In be_close(), be_eq_clean() must be called after all RX/TX/MCC queues
have been cleaned to ensure that any events caused while cleaning up
completions are notified/acked. Not clearing all events can cause
upredictable behaviour when RX rings are re-created in the subsequent
be_open().

Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
---
 drivers/net/ethernet/emulex/benet/be_cmds.c |    5 +++++
 drivers/net/ethernet/emulex/benet/be_main.c |   21 ++++++++++++---------
 2 files changed, 17 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/emulex/benet/be_cmds.c b/drivers/net/ethernet/emulex/benet/be_cmds.c
index f2875aa..8a250c3 100644
--- a/drivers/net/ethernet/emulex/benet/be_cmds.c
+++ b/drivers/net/ethernet/emulex/benet/be_cmds.c
@@ -298,7 +298,12 @@ void be_async_mcc_enable(struct be_adapter *adapter)
 
 void be_async_mcc_disable(struct be_adapter *adapter)
 {
+	spin_lock_bh(&adapter->mcc_cq_lock);
+
 	adapter->mcc_obj.rearm_cq = false;
+	be_cq_notify(adapter, adapter->mcc_obj.cq.id, false, 0);
+
+	spin_unlock_bh(&adapter->mcc_cq_lock);
 }
 
 int be_process_mcc(struct be_adapter *adapter)
diff --git a/drivers/net/ethernet/emulex/benet/be_main.c b/drivers/net/ethernet/emulex/benet/be_main.c
index f95612b..bf50e73 100644
--- a/drivers/net/ethernet/emulex/benet/be_main.c
+++ b/drivers/net/ethernet/emulex/benet/be_main.c
@@ -2398,13 +2398,22 @@ static int be_close(struct net_device *netdev)
 
 	be_roce_dev_close(adapter);
 
-	be_async_mcc_disable(adapter);
-
 	if (!lancer_chip(adapter))
 		be_intr_set(adapter, false);
 
-	for_all_evt_queues(adapter, eqo, i) {
+	for_all_evt_queues(adapter, eqo, i)
 		napi_disable(&eqo->napi);
+
+	be_async_mcc_disable(adapter);
+
+	/* Wait for all pending tx completions to arrive so that
+	 * all tx skbs are freed.
+	 */
+	be_tx_compl_clean(adapter);
+
+	be_rx_qs_destroy(adapter);
+
+	for_all_evt_queues(adapter, eqo, i) {
 		if (msix_enabled(adapter))
 			synchronize_irq(be_msix_vec_get(adapter, eqo));
 		else
@@ -2414,12 +2423,6 @@ static int be_close(struct net_device *netdev)
 
 	be_irq_unregister(adapter);
 
-	/* Wait for all pending tx completions to arrive so that
-	 * all tx skbs are freed.
-	 */
-	be_tx_compl_clean(adapter);
-
-	be_rx_qs_destroy(adapter);
 	return 0;
 }
 
-- 
1.7.1

^ permalink raw reply related

* [PATCH 2/2] be2net: fix wrong frag_idx reported by RX CQ
From: Sathya Perla @ 2012-12-18  5:38 UTC (permalink / raw)
  To: netdev; +Cc: Sathya Perla
In-Reply-To: <1355809131-8924-1-git-send-email-sathya.perla@emulex.com>

The RX CQ can report completions with invalid frag_idx when the RXQ that
was *previously* using it, was not cleaned up properly. This hits
a BUG_ON() in be2net.

When completion coalescing is enabled on a CQ, an explicit CQ-notify
(with rearm) is needed for each compl, to flush partially coalesced CQ
entries that are pending DMA.

In be_close(), this fix now notifies CQ for each compl, waits explicitly
for the flush compl to arrive and complains if it doesn't arrive.

Also renaming be_crit_error() to be_hw_error() as it's the more
appropriate name and to convey that we don't wait for the flush compl
only when a HW error has occurred.

Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
---
 drivers/net/ethernet/emulex/benet/be.h      |    2 +-
 drivers/net/ethernet/emulex/benet/be_main.c |   38 ++++++++++++++++++++++----
 2 files changed, 33 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/emulex/benet/be.h b/drivers/net/ethernet/emulex/benet/be.h
index abf26c7..3bc1912 100644
--- a/drivers/net/ethernet/emulex/benet/be.h
+++ b/drivers/net/ethernet/emulex/benet/be.h
@@ -616,7 +616,7 @@ static inline bool be_error(struct be_adapter *adapter)
 	return adapter->eeh_error || adapter->hw_error || adapter->fw_timeout;
 }
 
-static inline bool be_crit_error(struct be_adapter *adapter)
+static inline bool be_hw_error(struct be_adapter *adapter)
 {
 	return adapter->eeh_error || adapter->hw_error;
 }
diff --git a/drivers/net/ethernet/emulex/benet/be_main.c b/drivers/net/ethernet/emulex/benet/be_main.c
index bf50e73..9dca22b 100644
--- a/drivers/net/ethernet/emulex/benet/be_main.c
+++ b/drivers/net/ethernet/emulex/benet/be_main.c
@@ -1689,15 +1689,41 @@ static void be_rx_cq_clean(struct be_rx_obj *rxo)
 	struct be_queue_info *rxq = &rxo->q;
 	struct be_queue_info *rx_cq = &rxo->cq;
 	struct be_rx_compl_info *rxcp;
+	struct be_adapter *adapter = rxo->adapter;
+	int flush_wait = 0;
 	u16 tail;
 
-	/* First cleanup pending rx completions */
-	while ((rxcp = be_rx_compl_get(rxo)) != NULL) {
-		be_rx_compl_discard(rxo, rxcp);
-		be_cq_notify(rxo->adapter, rx_cq->id, false, 1);
+	/* Consume pending rx completions.
+	 * Wait for the flush completion (identified by zero num_rcvd)
+	 * to arrive. Notify CQ even when there are no more CQ entries
+	 * for HW to flush partially coalesced CQ entries.
+	 * In Lancer, there is no need to wait for flush compl.
+	 */
+	for (;;) {
+		rxcp = be_rx_compl_get(rxo);
+		if (rxcp == NULL) {
+			if (lancer_chip(adapter))
+				break;
+
+			if (flush_wait++ > 10 || be_hw_error(adapter)) {
+				dev_warn(&adapter->pdev->dev,
+					 "did not receive flush compl\n");
+				break;
+			}
+			be_cq_notify(adapter, rx_cq->id, true, 0);
+			mdelay(1);
+		} else {
+			be_rx_compl_discard(rxo, rxcp);
+			be_cq_notify(adapter, rx_cq->id, true, 1);
+			if (rxcp->num_rcvd == 0)
+				break;
+		}
 	}
 
-	/* Then free posted rx buffer that were not used */
+	/* After cleanup, leave the CQ in unarmed state */
+	be_cq_notify(adapter, rx_cq->id, false, 0);
+
+	/* Then free posted rx buffers that were not used */
 	tail = (rxq->head + rxq->len - atomic_read(&rxq->used)) % rxq->len;
 	for (; atomic_read(&rxq->used) > 0; index_inc(&tail, rxq->len)) {
 		page_info = get_rx_page_info(rxo, tail);
@@ -2157,7 +2183,7 @@ void be_detect_error(struct be_adapter *adapter)
 	u32 sliport_status = 0, sliport_err1 = 0, sliport_err2 = 0;
 	u32 i;
 
-	if (be_crit_error(adapter))
+	if (be_hw_error(adapter))
 		return;
 
 	if (lancer_chip(adapter)) {
-- 
1.7.1

^ permalink raw reply related

* Re: [patch net-next 0/4] net: allow to change carrier from userspace
From: Stephen Hemminger @ 2012-12-18  6:49 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: netdev, davem, edumazet, bhutchings, mirqus, greearb, fbl
In-Reply-To: <20121216105451.GA1546@minipsycho.orion>

On Sun, 16 Dec 2012 11:54:51 +0100
Jiri Pirko <jiri@resnulli.us> wrote:

> 
> I see that the patchset is in state "Rejected" in patchwork.
> Stephen convinced me for a moment that the problem can be handled by operstate.
> As it turned out (in last 3-4 emails in thread) operstate use would not
> be an option.
> 
> So how should I proceed? Should I repost the patchset? Anyone has any other
> comments?
> 
> thanks.

Don't take my comments so far as negative. Devices to need to be more controllable
from userspace. But I have concerns about introducing a new way to change state causing
more races.  For example, changing carrier state should cause netlink events to fire and
these should post to routing daemons etc. Also, what happens if some confused developer
mixes operstate and direct carrier control.

The root cause of all this confusion is that their are three ways of expressing
the same state, and they are controlled through different paths:
  a. Old BSD style flag bit IFF_RUNNING
  b. LINK_STATE bit in kernel (netif_carrier_ok)
  c. RFC2863 operational state

The operstate stuff is the most complete, but is the weakest in implementation:
  a. kernel drivers check netif_carrier_ok when they should be using netif_dormant
     (bridge is one example). But what will break if this changes?
  b. lower device state is not tracked correctly by tunnels and a few other layered devices
  c. dormant from kernel space was never used by much.

The good news is that the old BSD style IFF_RUNNING bit is the most commonly
used bit by applications and it works correctly in either carrier or operstate mode.

^ permalink raw reply

* Re: [PATCH 4/4] FEC: Add time stamping code and a PTP hardware clock
From: Richard Cochran @ 2012-12-18  7:04 UTC (permalink / raw)
  To: Sascha Hauer
  Cc: Frank Li, netdev, lznua, Frank Li, Shawn Guo, davem,
	linux-arm-kernel
In-Reply-To: <20121217200232.GS26326@pengutronix.de>

On Mon, Dec 17, 2012 at 09:02:32PM +0100, Sascha Hauer wrote:
> This leaves an option in the tree which can be used to break FEC on
> i.MX3/5.
> 
> 	depends on !SOC_IMX31 && !SOC_IMX35 && !SOC_IMX5
> 
> might be an option, but given that this patch seems to have bypassed any
> review I feel more like reverting it.

Instead of reverting, I suggest finding a solution (Frank) to let the
code work when it can work and to prevent it when it cannot. This
could be kconfig, DT, or run time probing of silicon revisions, but I
don't have access to this hardware, and so I can't really say how to
fix it.

Just for the record, I did in fact review this patch, and I commented
on exactly this point. Frank said we would address this point, and he
did so. Not knowing the imx family very well, I took his word for it.
After all, Frank has a Freescale address, and I would expect a
Freescale employee to know how to provide the right fix.

Thanks,
Richard

^ permalink raw reply

* [PATCH net-next] xfrm: removes a superfluous check and add a statistic
From: roy.qing.li @ 2012-12-18  8:39 UTC (permalink / raw)
  To: netdev

From: Li RongQing <roy.qing.li@gmail.com>

Remove the check if x->km.state equal to XFRM_STATE_VALID in
xfrm_state_check_expire(), which will be done before call
xfrm_state_check_expire().

add a LINUX_MIB_XFRMOUTSTATEINVALID statistic to record the
outbound error due to invalid xfrm state.

Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
---
 include/uapi/linux/snmp.h |    1 +
 net/xfrm/xfrm_output.c    |    6 ++++++
 net/xfrm/xfrm_proc.c      |    1 +
 net/xfrm/xfrm_state.c     |    3 ---
 4 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h
index fdfba23..93b24ce 100644
--- a/include/uapi/linux/snmp.h
+++ b/include/uapi/linux/snmp.h
@@ -274,6 +274,7 @@ enum
 	LINUX_MIB_XFRMOUTSTATEMODEERROR,	/* XfrmOutStateModeError */
 	LINUX_MIB_XFRMOUTSTATESEQERROR,		/* XfrmOutStateSeqError */
 	LINUX_MIB_XFRMOUTSTATEEXPIRED,		/* XfrmOutStateExpired */
+	LINUX_MIB_XFRMOUTSTATEINVALID,		/* XfrmOutStateInvalid */
 	LINUX_MIB_XFRMOUTPOLBLOCK,		/* XfrmOutPolBlock */
 	LINUX_MIB_XFRMOUTPOLDEAD,		/* XfrmOutPolDead */
 	LINUX_MIB_XFRMOUTPOLERROR,		/* XfrmOutPolError */
diff --git a/net/xfrm/xfrm_output.c b/net/xfrm/xfrm_output.c
index 95a338c..3670526 100644
--- a/net/xfrm/xfrm_output.c
+++ b/net/xfrm/xfrm_output.c
@@ -61,6 +61,12 @@ static int xfrm_output_one(struct sk_buff *skb, int err)
 		}
 
 		spin_lock_bh(&x->lock);
+
+		if (unlikely(x->km.state != XFRM_STATE_VALID)) {
+			XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTSTATEINVALID);
+			goto error_nolock;
+		}
+
 		err = xfrm_state_check_expire(x);
 		if (err) {
 			XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTSTATEEXPIRED);
diff --git a/net/xfrm/xfrm_proc.c b/net/xfrm/xfrm_proc.c
index d0a1af8..e4cd441 100644
--- a/net/xfrm/xfrm_proc.c
+++ b/net/xfrm/xfrm_proc.c
@@ -39,6 +39,7 @@ static const struct snmp_mib xfrm_mib_list[] = {
 	SNMP_MIB_ITEM("XfrmOutStateModeError", LINUX_MIB_XFRMOUTSTATEMODEERROR),
 	SNMP_MIB_ITEM("XfrmOutStateSeqError", LINUX_MIB_XFRMOUTSTATESEQERROR),
 	SNMP_MIB_ITEM("XfrmOutStateExpired", LINUX_MIB_XFRMOUTSTATEEXPIRED),
+	SNMP_MIB_ITEM("XfrmOutStateInvalid", LINUX_MIB_XFRMOUTSTATEINVALID),
 	SNMP_MIB_ITEM("XfrmOutPolBlock", LINUX_MIB_XFRMOUTPOLBLOCK),
 	SNMP_MIB_ITEM("XfrmOutPolDead", LINUX_MIB_XFRMOUTPOLDEAD),
 	SNMP_MIB_ITEM("XfrmOutPolError", LINUX_MIB_XFRMOUTPOLERROR),
diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index 3459692..05db236 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -1370,9 +1370,6 @@ int xfrm_state_check_expire(struct xfrm_state *x)
 	if (!x->curlft.use_time)
 		x->curlft.use_time = get_seconds();
 
-	if (x->km.state != XFRM_STATE_VALID)
-		return -EINVAL;
-
 	if (x->curlft.bytes >= x->lft.hard_byte_limit ||
 	    x->curlft.packets >= x->lft.hard_packet_limit) {
 		x->km.state = XFRM_STATE_EXPIRED;
-- 
1.7.10.4

^ permalink raw reply related

* Re: [PATCH 4/4] FEC: Add time stamping code and a PTP hardware clock
From: Frank Li @ 2012-12-18  8:51 UTC (permalink / raw)
  To: Richard Cochran
  Cc: Sascha Hauer, Shawn Guo, Frank Li, lznua, linux-arm-kernel,
	netdev, davem
In-Reply-To: <20121218070420.GA2946@netboy.at.omicron.at>

>>
>> might be an option, but given that this patch seems to have bypassed any
>> review I feel more like reverting it.
>
> Instead of reverting, I suggest finding a solution (Frank) to let the
> code work when it can work and to prevent it when it cannot. This
> could be kconfig, DT, or run time probing of silicon revisions, but I
> don't have access to this hardware, and so I can't really say how to
> fix it.

I am traveling in this week. I will try to find out the solution after
come back office
in next week. But quick solution is shawn's patch, which resolve mx3
and mx5 problem at least in short term.

>
> Just for the record, I did in fact review this patch, and I commented
> on exactly this point. Frank said we would address this point, and he
> did so. Not knowing the imx family very well, I took his word for it.
> After all, Frank has a Freescale address, and I would expect a
> Freescale employee to know how to provide the right fix.
>
> Thanks,
> Richard

^ permalink raw reply

* RE: [PATCH v2] netlink: align attributes on 64-bits
From: David Laight @ 2012-12-18  9:19 UTC (permalink / raw)
  To: nicolas.dichtel; +Cc: bhutchings, tgraf, netdev, davem
In-Reply-To: <50CF57DC.5050804@6wind.com>

> Le 17/12/2012 18:06, David Laight a écrit :
> >>   int nla_put(struct sk_buff *skb, int attrtype, int attrlen, const void *data)
> >>   {
> >> -	if (unlikely(skb_tailroom(skb) < nla_total_size(attrlen)))
> >> +	int align = IS_ALIGNED((unsigned long)skb_tail_pointer(skb), 8) ? 0 : 4;
> >
> > I've just realised where you are adding this!
> > You only want to add pad if the attribute is a single 64bit item,
> > not whenever the destination is misaligned.
> As said in the commit log, I want to align all attributes. An attribute can be
> like this:
> 
> struct foo {
> 	__u32 bar1;
> 	__u32 bar2;
> 	__u64 bar3;
> }
> 
> nla_put() don't know what is contained in the attribute.

Put there is no need to 8-byte align something whose size isn't a
multiple of 8 bytes.

> > ...
> >> +	if (align) {
> >> +		/* Goal is to add an attribute with size 4. We know that
> >> +		 * NLA_HDRLEN is 4, hence payload is 0.
> >> +		 */
> >> +		__nla_reserve(skb, 0, 0);
> >
> > One of those zeros should be 'align - 4', then the comment
> > can be more descriptive.

> I thought if you were to research why we use 0, you would know that the first 0
> is the type and the second is the payload size...

I can tell that one is the type and the other the size, you've
implied that the 'type+size' actually total 4 bytes.
I don't need to find out which is which!
Now you've told me I'd have written:
	_nla_reserve(skb, 0, align - NLA_HDRLEN);

The compiler could well have tracked the value - so know it is 4.
OTOH you might want to generate the size of 'align' without
using a conditional.

	David

^ permalink raw reply

* Re: [patch net-next 0/4] net: allow to change carrier from userspace
From: Jiri Pirko @ 2012-12-18  9:31 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: netdev, davem, edumazet, bhutchings, mirqus, greearb, fbl
In-Reply-To: <20121217224957.70775f99@nehalam.linuxnetplumber.net>

Tue, Dec 18, 2012 at 07:49:57AM CET, shemminger@vyatta.com wrote:
>On Sun, 16 Dec 2012 11:54:51 +0100
>Jiri Pirko <jiri@resnulli.us> wrote:
>
>> 
>> I see that the patchset is in state "Rejected" in patchwork.
>> Stephen convinced me for a moment that the problem can be handled by operstate.
>> As it turned out (in last 3-4 emails in thread) operstate use would not
>> be an option.
>> 
>> So how should I proceed? Should I repost the patchset? Anyone has any other
>> comments?
>> 
>> thanks.
>
>Don't take my comments so far as negative. Devices to need to be more controllable
>from userspace. But I have concerns about introducing a new way to change state causing
>more races.  For example, changing carrier state should cause netlink events to fire and
>these should post to routing daemons etc. Also, what happens if some confused developer
>mixes operstate and direct carrier control.

I do not think that the race you are describing is of any concern. The
same can happen now for any device. My patchset only adds a possibility
for "soft devices" to change the carrier as well.

Developer will not be likely confused. As the possibility of carrier
change from userspace will be limited to small set of devices, for other
devices the attempt will lead to -EOPNOTSUPP (in contrast with operstate
which is available for all devices).

I can add a comments/notes to code and operstates.txt stating the
purpose of this iface.

>
>The root cause of all this confusion is that their are three ways of expressing
>the same state, and they are controlled through different paths:
>  a. Old BSD style flag bit IFF_RUNNING
>  b. LINK_STATE bit in kernel (netif_carrier_ok)
>  c. RFC2863 operational state

I do not think so. Yes, for a) and c), these are strictly connected,
expressing the same thing. But b) is not the same. It's on lower level
than a) and c). What b) can be compared to is IFF_LOWER_UP.

>
>The operstate stuff is the most complete, but is the weakest in implementation:
>  a. kernel drivers check netif_carrier_ok when they should be using netif_dormant
>     (bridge is one example). But what will break if this changes?
I agree, that should be changed.

>  b. lower device state is not tracked correctly by tunnels and a few other layered devices
>  c. dormant from kernel space was never used by much.
>
>The good news is that the old BSD style IFF_RUNNING bit is the most commonly
>used bit by applications and it works correctly in either carrier or operstate mode.

That is indeed a good thing.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox