Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: PATCH: Network Device Naming mechanism and policy
From: Bill Nottingham @ 2009-10-12 17:37 UTC (permalink / raw)
  To: Scott James Remnant
  Cc: Matt Domsch, Narendra K, netdev, linux-hotplug, jordan_hargrave
In-Reply-To: <1255344075.2143.1.camel@warcraft>

Scott James Remnant (scott@ubuntu.com) said: 
> On the other hand, they *tend* to be unique for a wide range of systems.
> This makes them pretty comparable to LABELs on disks, and we have
> a /dev/disk/by-label
> 
> Remember that udev already supports symlink stacking, and priorities and
> such.
> 
> I don't think there's any danger of supporting a /dev/netdev/by-mac by
> default, it'll be a benefit to most and those who don't have unique MACs
> will just ignore it.

At the moment, we do not appear to get the proper change uevents from things
like 'ip link set dev <foo> address <bar>', so we can't currently maintain
these symlinks.

Bill

^ permalink raw reply

* Re: PATCH: Network Device Naming mechanism and policy
From: Bill Nottingham @ 2009-10-12 17:45 UTC (permalink / raw)
  To: Greg KH
  Cc: Matt Domsch, Stephen Hemminger, netdev, linux-hotplug, Narendra_K,
	jordan_hargrave
In-Reply-To: <20091010052308.GA12458@kroah.com>

Greg KH (greg@kroah.com) said: 
> > Today, port naming is completely nondeterministic.  If you have but
> > one NIC, there are few chances to get the name wrong (it'll be eth0).
> > If you have >1 NIC, chances increase to get it wrong.
> 
> That is why all distros name network devices based on the only
> deterministic thing they have today, the MAC address.  I still fail to
> see why you do not like this solution, it is honestly the only way to
> properly name network devices in a sane manner.
> 
> All distros also provide a way to easily rename the network devices, to
> place a specific name on a specific MAC address, so again, this should
> all be solved already.

No, it's not solved. Even if you have persistent names once you install,
if you ever re-image, you're likely to get *different* persistent names;
the first load will always be non-detmerministic.

The only way around this would be to have some sort of screen like:

  Would you like your network devices to be enumerated by

  [ ] MAC address
  [ ] PCI device order
  [ ] Driver name
  [ ] Other

which is just all sorts of fail in and of itself. Especially since
once you get to the point where you can coherently ask this in a
native installer, the drivers have already loaded.

Bill

^ permalink raw reply

* Re: [PATCH 1/1] net: Introduce recvmmsg socket syscall
From: Nir Tzachar @ 2009-10-12 17:53 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: David Miller, netdev, Arnaldo Carvalho de Melo, Caitlin Bestler,
	Chris Van Hoof, Clark Williams, Neil Horman, Nivedita Singhvi,
	Paul Moore, Rémi Denis-Courmont, Steven Whitehouse
In-Reply-To: <1255364440-23271-1-git-send-email-acme@redhat.com>

Hi Arnaldo.

Do you have any plans on how we can further investigate the delays I
have seen with the second part of the patch? I have tried to simply
unlock/lock the socket's mutex every couple of iterations inside the
loop (to allow the system to process some backlog), but this seems to
have little to no effect.

Also, a way to enable/disable the no_lock version at runtime will
greatly help in testing. Maybe by first introducing a second syscall,
recvmmsg_no_lock, for testing purposes??

Cheers,
Nir.

On Mon, Oct 12, 2009 at 6:20 PM, Arnaldo Carvalho de Melo
<acme@ghostprotocols.net> wrote:
> Meaning receive multiple messages, reducing the number of syscalls and
> net stack entry/exit operations.
>
> Next patches will introduce mechanisms where protocols that want to
> optimize this operation will provide an unlocked_recvmsg operation.
>
> This takes into account comments made by:
>
> . Paul Moore: sock_recvmsg is called only for the first datagram,
>  sock_recvmsg_nosec is used for the rest.
>
> . Caitlin Bestler: recvmmsg now has a struct timespec timeout, that
>  works in the same fashion as the ppoll one.
>
>  If the underlying protocol returns a datagram with MSG_OOB set, this
>  will make recvmmsg return right away with as many datagrams (+ the OOB
>  one) it has received so far.
>
> . Rémi Denis-Courmont & Steven Whitehouse: If we receive N < vlen
>  datagrams and then recvmsg returns an error, recvmmsg will return
>  the successfully received datagrams, store the error and return it
>  in the next call.
>
> This paves the way for a subsequent optimization, sk_prot->unlocked_recvmsg,
> where we will be able to acquire the lock only at batch start and end, not at
> every underlying recvmsg call.
>
> Cc: Caitlin Bestler <caitlin.bestler@gmail.com>
> Cc: Chris Van Hoof <vanhoof@redhat.com>
> Cc: Clark Williams <williams@redhat.com>
> Cc: Neil Horman <nhorman@tuxdriver.com>
> Cc: Nir Tzachar <nir.tzachar@gmail.com>
> Cc: Nivedita Singhvi <niv@us.ibm.com>
> Cc: Paul Moore <paul.moore@hp.com>
> Cc: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>
> Cc: Steven Whitehouse <steve@chygwyn.com>
> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
> ---
>  arch/alpha/kernel/systbls.S            |    1 +
>  arch/arm/kernel/calls.S                |    1 +
>  arch/avr32/kernel/syscall_table.S      |    1 +
>  arch/blackfin/mach-common/entry.S      |    1 +
>  arch/ia64/kernel/entry.S               |    1 +
>  arch/microblaze/kernel/syscall_table.S |    1 +
>  arch/mips/kernel/scall32-o32.S         |    1 +
>  arch/mips/kernel/scall64-64.S          |    1 +
>  arch/mips/kernel/scall64-n32.S         |    1 +
>  arch/mips/kernel/scall64-o32.S         |    1 +
>  arch/sh/kernel/syscalls_64.S           |    1 +
>  arch/sparc/kernel/systbls_64.S         |    4 +-
>  arch/x86/ia32/ia32entry.S              |    1 +
>  arch/x86/include/asm/unistd_32.h       |    3 +-
>  arch/x86/include/asm/unistd_64.h       |    2 +
>  arch/x86/kernel/syscall_table_32.S     |    1 +
>  arch/xtensa/include/asm/unistd.h       |    4 +-
>  include/linux/net.h                    |    1 +
>  include/linux/socket.h                 |   10 ++
>  include/linux/syscalls.h               |    4 +
>  include/net/compat.h                   |    8 +
>  kernel/sys_ni.c                        |    2 +
>  net/compat.c                           |   33 +++++-
>  net/socket.c                           |  225 ++++++++++++++++++++++++++------
>  24 files changed, 260 insertions(+), 49 deletions(-)
>
> diff --git a/arch/alpha/kernel/systbls.S b/arch/alpha/kernel/systbls.S
> index 95c9aef..cda6b8b 100644
> --- a/arch/alpha/kernel/systbls.S
> +++ b/arch/alpha/kernel/systbls.S
> @@ -497,6 +497,7 @@ sys_call_table:
>        .quad sys_signalfd
>        .quad sys_ni_syscall
>        .quad sys_eventfd
> +       .quad sys_recvmmsg
>
>        .size sys_call_table, . - sys_call_table
>        .type sys_call_table, @object
> diff --git a/arch/arm/kernel/calls.S b/arch/arm/kernel/calls.S
> index fafce1b..f58c115 100644
> --- a/arch/arm/kernel/calls.S
> +++ b/arch/arm/kernel/calls.S
> @@ -374,6 +374,7 @@
>                CALL(sys_pwritev)
>                CALL(sys_rt_tgsigqueueinfo)
>                CALL(sys_perf_event_open)
> +/* 365 */      CALL(sys_recvmmsg)
>  #ifndef syscalls_counted
>  .equ syscalls_padding, ((NR_syscalls + 3) & ~3) - NR_syscalls
>  #define syscalls_counted
> diff --git a/arch/avr32/kernel/syscall_table.S b/arch/avr32/kernel/syscall_table.S
> index 7ee0057..e76bad1 100644
> --- a/arch/avr32/kernel/syscall_table.S
> +++ b/arch/avr32/kernel/syscall_table.S
> @@ -295,4 +295,5 @@ sys_call_table:
>        .long   sys_signalfd
>        .long   sys_ni_syscall          /* 280, was sys_timerfd */
>        .long   sys_eventfd
> +       .long   sys_recvmmsg
>        .long   sys_ni_syscall          /* r8 is saturated at nr_syscalls */
> diff --git a/arch/blackfin/mach-common/entry.S b/arch/blackfin/mach-common/entry.S
> index 1e7cac2..4869272 100644
> --- a/arch/blackfin/mach-common/entry.S
> +++ b/arch/blackfin/mach-common/entry.S
> @@ -1621,6 +1621,7 @@ ENTRY(_sys_call_table)
>        .long _sys_pwritev
>        .long _sys_rt_tgsigqueueinfo
>        .long _sys_perf_event_open
> +       .long _sys_recvmmsg             /* 370 */
>
>        .rept NR_syscalls-(.-_sys_call_table)/4
>        .long _sys_ni_syscall
> diff --git a/arch/ia64/kernel/entry.S b/arch/ia64/kernel/entry.S
> index d0e7d37..d75b872 100644
> --- a/arch/ia64/kernel/entry.S
> +++ b/arch/ia64/kernel/entry.S
> @@ -1806,6 +1806,7 @@ sys_call_table:
>        data8 sys_preadv
>        data8 sys_pwritev                       // 1320
>        data8 sys_rt_tgsigqueueinfo
> +       data8 sys_recvmmsg
>
>        .org sys_call_table + 8*NR_syscalls     // guard against failures to increase NR_syscalls
>  #endif /* __IA64_ASM_PARAVIRTUALIZED_NATIVE */
> diff --git a/arch/microblaze/kernel/syscall_table.S b/arch/microblaze/kernel/syscall_table.S
> index ecec191..c1ab1dc 100644
> --- a/arch/microblaze/kernel/syscall_table.S
> +++ b/arch/microblaze/kernel/syscall_table.S
> @@ -371,3 +371,4 @@ ENTRY(sys_call_table)
>        .long sys_ni_syscall
>        .long sys_rt_tgsigqueueinfo     /* 365 */
>        .long sys_perf_event_open
> +       .long sys_recvmmsg
> diff --git a/arch/mips/kernel/scall32-o32.S b/arch/mips/kernel/scall32-o32.S
> index fd2a9bb..17202bb 100644
> --- a/arch/mips/kernel/scall32-o32.S
> +++ b/arch/mips/kernel/scall32-o32.S
> @@ -583,6 +583,7 @@ einval:     li      v0, -ENOSYS
>        sys     sys_rt_tgsigqueueinfo   4
>        sys     sys_perf_event_open     5
>        sys     sys_accept4             4
> +       sys     sys_recvmmsg            5
>        .endm
>
>        /* We pre-compute the number of _instruction_ bytes needed to
> diff --git a/arch/mips/kernel/scall64-64.S b/arch/mips/kernel/scall64-64.S
> index 18bf7f3..a8a6c59 100644
> --- a/arch/mips/kernel/scall64-64.S
> +++ b/arch/mips/kernel/scall64-64.S
> @@ -420,4 +420,5 @@ sys_call_table:
>        PTR     sys_rt_tgsigqueueinfo
>        PTR     sys_perf_event_open
>        PTR     sys_accept4
> +       PTR     sys_recvmmsg
>        .size   sys_call_table,.-sys_call_table
> diff --git a/arch/mips/kernel/scall64-n32.S b/arch/mips/kernel/scall64-n32.S
> index 6ebc079..5154e64 100644
> --- a/arch/mips/kernel/scall64-n32.S
> +++ b/arch/mips/kernel/scall64-n32.S
> @@ -418,4 +418,5 @@ EXPORT(sysn32_call_table)
>        PTR     compat_sys_rt_tgsigqueueinfo    /* 5295 */
>        PTR     sys_perf_event_open
>        PTR     sys_accept4
> +       PTR     compat_sys_recvmmsg
>        .size   sysn32_call_table,.-sysn32_call_table
> diff --git a/arch/mips/kernel/scall64-o32.S b/arch/mips/kernel/scall64-o32.S
> index 9bbf977..d0eff53 100644
> --- a/arch/mips/kernel/scall64-o32.S
> +++ b/arch/mips/kernel/scall64-o32.S
> @@ -538,4 +538,5 @@ sys_call_table:
>        PTR     compat_sys_rt_tgsigqueueinfo
>        PTR     sys_perf_event_open
>        PTR     sys_accept4
> +       PTR     compat_sys_recvmmsg
>        .size   sys_call_table,.-sys_call_table
> diff --git a/arch/sh/kernel/syscalls_64.S b/arch/sh/kernel/syscalls_64.S
> index 5bfde6c..07d2aae 100644
> --- a/arch/sh/kernel/syscalls_64.S
> +++ b/arch/sh/kernel/syscalls_64.S
> @@ -391,3 +391,4 @@ sys_call_table:
>        .long sys_pwritev
>        .long sys_rt_tgsigqueueinfo
>        .long sys_perf_event_open
> +       .long sys_recvmmsg              /* 365 */
> diff --git a/arch/sparc/kernel/systbls_64.S b/arch/sparc/kernel/systbls_64.S
> index 009825f..f37bef7 100644
> --- a/arch/sparc/kernel/systbls_64.S
> +++ b/arch/sparc/kernel/systbls_64.S
> @@ -83,7 +83,7 @@ sys_call_table32:
>  /*310*/        .word compat_sys_utimensat, compat_sys_signalfd, sys_timerfd_create, sys_eventfd, compat_sys_fallocate
>        .word compat_sys_timerfd_settime, compat_sys_timerfd_gettime, compat_sys_signalfd4, sys_eventfd2, sys_epoll_create1
>  /*320*/        .word sys_dup3, sys_pipe2, sys_inotify_init1, sys_accept4, compat_sys_preadv
> -       .word compat_sys_pwritev, compat_sys_rt_tgsigqueueinfo, sys_perf_event_open
> +       .word compat_sys_pwritev, compat_sys_rt_tgsigqueueinfo, sys_perf_event_open, compat_sys_recvmmsg
>
>  #endif /* CONFIG_COMPAT */
>
> @@ -158,4 +158,4 @@ sys_call_table:
>  /*310*/        .word sys_utimensat, sys_signalfd, sys_timerfd_create, sys_eventfd, sys_fallocate
>        .word sys_timerfd_settime, sys_timerfd_gettime, sys_signalfd4, sys_eventfd2, sys_epoll_create1
>  /*320*/        .word sys_dup3, sys_pipe2, sys_inotify_init1, sys_accept4, sys_preadv
> -       .word sys_pwritev, sys_rt_tgsigqueueinfo, sys_perf_event_open
> +       .word sys_pwritev, sys_rt_tgsigqueueinfo, sys_perf_event_open, sys_recvmmsg
> diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
> index 74619c4..11a6c79 100644
> --- a/arch/x86/ia32/ia32entry.S
> +++ b/arch/x86/ia32/ia32entry.S
> @@ -832,4 +832,5 @@ ia32_sys_call_table:
>        .quad compat_sys_pwritev
>        .quad compat_sys_rt_tgsigqueueinfo      /* 335 */
>        .quad sys_perf_event_open
> +       .quad compat_sys_recvmmsg
>  ia32_syscall_end:
> diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
> index 6fb3c20..3baf379 100644
> --- a/arch/x86/include/asm/unistd_32.h
> +++ b/arch/x86/include/asm/unistd_32.h
> @@ -342,10 +342,11 @@
>  #define __NR_pwritev           334
>  #define __NR_rt_tgsigqueueinfo 335
>  #define __NR_perf_event_open   336
> +#define __NR_recvmmsg          337
>
>  #ifdef __KERNEL__
>
> -#define NR_syscalls 337
> +#define NR_syscalls 338
>
>  #define __ARCH_WANT_IPC_PARSE_VERSION
>  #define __ARCH_WANT_OLD_READDIR
> diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h
> index 8d3ad0a..4843f7b 100644
> --- a/arch/x86/include/asm/unistd_64.h
> +++ b/arch/x86/include/asm/unistd_64.h
> @@ -661,6 +661,8 @@ __SYSCALL(__NR_pwritev, sys_pwritev)
>  __SYSCALL(__NR_rt_tgsigqueueinfo, sys_rt_tgsigqueueinfo)
>  #define __NR_perf_event_open                   298
>  __SYSCALL(__NR_perf_event_open, sys_perf_event_open)
> +#define __NR_recvmmsg                          299
> +__SYSCALL(__NR_recvmmsg, sys_recvmmsg)
>
>  #ifndef __NO_STUBS
>  #define __ARCH_WANT_OLD_READDIR
> diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
> index 0157cd2..70c2125 100644
> --- a/arch/x86/kernel/syscall_table_32.S
> +++ b/arch/x86/kernel/syscall_table_32.S
> @@ -336,3 +336,4 @@ ENTRY(sys_call_table)
>        .long sys_pwritev
>        .long sys_rt_tgsigqueueinfo     /* 335 */
>        .long sys_perf_event_open
> +       .long sys_recvmmsg
> diff --git a/arch/xtensa/include/asm/unistd.h b/arch/xtensa/include/asm/unistd.h
> index c092c8f..4e55dc7 100644
> --- a/arch/xtensa/include/asm/unistd.h
> +++ b/arch/xtensa/include/asm/unistd.h
> @@ -681,8 +681,10 @@ __SYSCALL(304, sys_signalfd, 3)
>  __SYSCALL(305, sys_ni_syscall, 0)
>  #define __NR_eventfd                           306
>  __SYSCALL(306, sys_eventfd, 1)
> +#define __NR_recvmmsg                          307
> +__SYSCALL(307, sys_recvmmsg, 5)
>
> -#define __NR_syscall_count                     307
> +#define __NR_syscall_count                     308
>
>  /*
>  * sysxtensa syscall handler
> diff --git a/include/linux/net.h b/include/linux/net.h
> index 529a093..b42bb60 100644
> --- a/include/linux/net.h
> +++ b/include/linux/net.h
> @@ -41,6 +41,7 @@
>  #define SYS_SENDMSG    16              /* sys_sendmsg(2)               */
>  #define SYS_RECVMSG    17              /* sys_recvmsg(2)               */
>  #define SYS_ACCEPT4    18              /* sys_accept4(2)               */
> +#define SYS_RECVMMSG   19              /* sys_recvmmsg(2)              */
>
>  typedef enum {
>        SS_FREE = 0,                    /* not allocated                */
> diff --git a/include/linux/socket.h b/include/linux/socket.h
> index 3273a0c..59966f1 100644
> --- a/include/linux/socket.h
> +++ b/include/linux/socket.h
> @@ -65,6 +65,12 @@ struct msghdr {
>        unsigned        msg_flags;
>  };
>
> +/* For recvmmsg/sendmmsg */
> +struct mmsghdr {
> +       struct msghdr   msg_hdr;
> +       unsigned        msg_len;
> +};
> +
>  /*
>  *     POSIX 1003.1g - ancillary data object information
>  *     Ancillary data consits of a sequence of pairs of
> @@ -312,6 +318,10 @@ extern int move_addr_to_user(struct sockaddr *kaddr, int klen, void __user *uadd
>  extern int move_addr_to_kernel(void __user *uaddr, int ulen, struct sockaddr *kaddr);
>  extern int put_cmsg(struct msghdr*, int level, int type, int len, void *data);
>
> +struct timespec;
> +
> +extern int __sys_recvmmsg(int fd, struct mmsghdr __user *mmsg, unsigned int vlen,
> +                         unsigned int flags, struct timespec *timeout);
>  #endif
>  #endif /* not kernel and not glibc */
>  #endif /* _LINUX_SOCKET_H */
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index a990ace..714f063 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -25,6 +25,7 @@ struct linux_dirent64;
>  struct list_head;
>  struct msgbuf;
>  struct msghdr;
> +struct mmsghdr;
>  struct msqid_ds;
>  struct new_utsname;
>  struct nfsctl_arg;
> @@ -677,6 +678,9 @@ asmlinkage long sys_recv(int, void __user *, size_t, unsigned);
>  asmlinkage long sys_recvfrom(int, void __user *, size_t, unsigned,
>                                struct sockaddr __user *, int __user *);
>  asmlinkage long sys_recvmsg(int fd, struct msghdr __user *msg, unsigned flags);
> +asmlinkage long sys_recvmmsg(int fd, struct mmsghdr __user *msg,
> +                            unsigned int vlen, unsigned flags,
> +                            struct timespec __user *timeout);
>  asmlinkage long sys_socket(int, int, int);
>  asmlinkage long sys_socketpair(int, int, int, int __user *);
>  asmlinkage long sys_socketcall(int call, unsigned long __user *args);
> diff --git a/include/net/compat.h b/include/net/compat.h
> index 7c30028..9679f05 100644
> --- a/include/net/compat.h
> +++ b/include/net/compat.h
> @@ -18,6 +18,11 @@ struct compat_msghdr {
>        compat_uint_t   msg_flags;
>  };
>
> +struct compat_mmsghdr {
> +       struct compat_msghdr msg_hdr;
> +       compat_uint_t        msg_len;
> +};
> +
>  struct compat_cmsghdr {
>        compat_size_t   cmsg_len;
>        compat_int_t    cmsg_level;
> @@ -35,6 +40,9 @@ extern int get_compat_msghdr(struct msghdr *, struct compat_msghdr __user *);
>  extern int verify_compat_iovec(struct msghdr *, struct iovec *, struct sockaddr *, int);
>  extern asmlinkage long compat_sys_sendmsg(int,struct compat_msghdr __user *,unsigned);
>  extern asmlinkage long compat_sys_recvmsg(int,struct compat_msghdr __user *,unsigned);
> +extern asmlinkage long compat_sys_recvmmsg(int, struct compat_mmsghdr __user *,
> +                                          unsigned, unsigned,
> +                                          struct timespec __user *);
>  extern asmlinkage long compat_sys_getsockopt(int, int, int, char __user *, int __user *);
>  extern int put_cmsg_compat(struct msghdr*, int, int, int, void *);
>
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index e06d0b8..f050ba8 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -48,8 +48,10 @@ cond_syscall(sys_shutdown);
>  cond_syscall(sys_sendmsg);
>  cond_syscall(compat_sys_sendmsg);
>  cond_syscall(sys_recvmsg);
> +cond_syscall(sys_recvmmsg);
>  cond_syscall(compat_sys_recvmsg);
>  cond_syscall(compat_sys_recvfrom);
> +cond_syscall(compat_sys_recvmmsg);
>  cond_syscall(sys_socketcall);
>  cond_syscall(sys_futex);
>  cond_syscall(compat_sys_futex);
> diff --git a/net/compat.c b/net/compat.c
> index a407c3a..e13f525 100644
> --- a/net/compat.c
> +++ b/net/compat.c
> @@ -727,10 +727,10 @@ EXPORT_SYMBOL(compat_mc_getsockopt);
>
>  /* Argument list sizes for compat_sys_socketcall */
>  #define AL(x) ((x) * sizeof(u32))
> -static unsigned char nas[19]={AL(0),AL(3),AL(3),AL(3),AL(2),AL(3),
> +static unsigned char nas[20]={AL(0),AL(3),AL(3),AL(3),AL(2),AL(3),
>                                AL(3),AL(3),AL(4),AL(4),AL(4),AL(6),
>                                AL(6),AL(2),AL(5),AL(5),AL(3),AL(3),
> -                               AL(4)};
> +                               AL(4),AL(5)};
>  #undef AL
>
>  asmlinkage long compat_sys_sendmsg(int fd, struct compat_msghdr __user *msg, unsigned flags)
> @@ -755,13 +755,36 @@ asmlinkage long compat_sys_recvfrom(int fd, void __user *buf, size_t len,
>        return sys_recvfrom(fd, buf, len, flags | MSG_CMSG_COMPAT, addr, addrlen);
>  }
>
> +asmlinkage long compat_sys_recvmmsg(int fd, struct compat_mmsghdr __user *mmsg,
> +                                   unsigned vlen, unsigned int flags,
> +                                   struct timespec __user *timeout)
> +{
> +       int datagrams;
> +       struct timespec ktspec;
> +       struct compat_timespec __user *utspec =
> +                       (struct compat_timespec __user *)timeout;
> +
> +       if (get_user(ktspec.tv_sec, &utspec->tv_sec) ||
> +           get_user(ktspec.tv_nsec, &utspec->tv_nsec))
> +               return -EFAULT;
> +
> +       datagrams = __sys_recvmmsg(fd, (struct mmsghdr __user *)mmsg, vlen,
> +                                  flags | MSG_CMSG_COMPAT, &ktspec);
> +       if (datagrams > 0 &&
> +           (put_user(ktspec.tv_sec, &utspec->tv_sec) ||
> +            put_user(ktspec.tv_nsec, &utspec->tv_nsec)))
> +               datagrams = -EFAULT;
> +
> +       return datagrams;
> +}
> +
>  asmlinkage long compat_sys_socketcall(int call, u32 __user *args)
>  {
>        int ret;
>        u32 a[6];
>        u32 a0, a1;
>
> -       if (call < SYS_SOCKET || call > SYS_ACCEPT4)
> +       if (call < SYS_SOCKET || call > SYS_RECVMMSG)
>                return -EINVAL;
>        if (copy_from_user(a, args, nas[call]))
>                return -EFAULT;
> @@ -823,6 +846,10 @@ asmlinkage long compat_sys_socketcall(int call, u32 __user *args)
>        case SYS_RECVMSG:
>                ret = compat_sys_recvmsg(a0, compat_ptr(a1), a[2]);
>                break;
> +       case SYS_RECVMMSG:
> +               ret = compat_sys_recvmmsg(a0, compat_ptr(a1), a[2], a[3],
> +                                         compat_ptr(a[4]));
> +               break;
>        case SYS_ACCEPT4:
>                ret = sys_accept4(a0, compat_ptr(a1), compat_ptr(a[2]), a[3]);
>                break;
> diff --git a/net/socket.c b/net/socket.c
> index 954f338..3dd03df 100644
> --- a/net/socket.c
> +++ b/net/socket.c
> @@ -668,10 +668,9 @@ void __sock_recv_timestamp(struct msghdr *msg, struct sock *sk,
>
>  EXPORT_SYMBOL_GPL(__sock_recv_timestamp);
>
> -static inline int __sock_recvmsg(struct kiocb *iocb, struct socket *sock,
> -                                struct msghdr *msg, size_t size, int flags)
> +static inline int __sock_recvmsg_nosec(struct kiocb *iocb, struct socket *sock,
> +                                      struct msghdr *msg, size_t size, int flags)
>  {
> -       int err;
>        struct sock_iocb *si = kiocb_to_siocb(iocb);
>
>        si->sock = sock;
> @@ -680,13 +679,17 @@ static inline int __sock_recvmsg(struct kiocb *iocb, struct socket *sock,
>        si->size = size;
>        si->flags = flags;
>
> -       err = security_socket_recvmsg(sock, msg, size, flags);
> -       if (err)
> -               return err;
> -
>        return sock->ops->recvmsg(iocb, sock, msg, size, flags);
>  }
>
> +static inline int __sock_recvmsg(struct kiocb *iocb, struct socket *sock,
> +                                struct msghdr *msg, size_t size, int flags)
> +{
> +       int err = security_socket_recvmsg(sock, msg, size, flags);
> +
> +       return err ?: __sock_recvmsg_nosec(iocb, sock, msg, size, flags);
> +}
> +
>  int sock_recvmsg(struct socket *sock, struct msghdr *msg,
>                 size_t size, int flags)
>  {
> @@ -702,6 +705,21 @@ int sock_recvmsg(struct socket *sock, struct msghdr *msg,
>        return ret;
>  }
>
> +static int sock_recvmsg_nosec(struct socket *sock, struct msghdr *msg,
> +                             size_t size, int flags)
> +{
> +       struct kiocb iocb;
> +       struct sock_iocb siocb;
> +       int ret;
> +
> +       init_sync_kiocb(&iocb, NULL);
> +       iocb.private = &siocb;
> +       ret = __sock_recvmsg_nosec(&iocb, sock, msg, size, flags);
> +       if (-EIOCBQUEUED == ret)
> +               ret = wait_on_sync_kiocb(&iocb);
> +       return ret;
> +}
> +
>  int kernel_recvmsg(struct socket *sock, struct msghdr *msg,
>                   struct kvec *vec, size_t num, size_t size, int flags)
>  {
> @@ -1968,22 +1986,15 @@ out:
>        return err;
>  }
>
> -/*
> - *     BSD recvmsg interface
> - */
> -
> -SYSCALL_DEFINE3(recvmsg, int, fd, struct msghdr __user *, msg,
> -               unsigned int, flags)
> +static int __sys_recvmsg(struct socket *sock, struct msghdr __user *msg,
> +                        struct msghdr *msg_sys, unsigned flags, int nosec)
>  {
>        struct compat_msghdr __user *msg_compat =
>            (struct compat_msghdr __user *)msg;
> -       struct socket *sock;
>        struct iovec iovstack[UIO_FASTIOV];
>        struct iovec *iov = iovstack;
> -       struct msghdr msg_sys;
>        unsigned long cmsg_ptr;
>        int err, iov_size, total_len, len;
> -       int fput_needed;
>
>        /* kernel mode address */
>        struct sockaddr_storage addr;
> @@ -1993,27 +2004,23 @@ SYSCALL_DEFINE3(recvmsg, int, fd, struct msghdr __user *, msg,
>        int __user *uaddr_len;
>
>        if (MSG_CMSG_COMPAT & flags) {
> -               if (get_compat_msghdr(&msg_sys, msg_compat))
> +               if (get_compat_msghdr(msg_sys, msg_compat))
>                        return -EFAULT;
>        }
> -       else if (copy_from_user(&msg_sys, msg, sizeof(struct msghdr)))
> +       else if (copy_from_user(msg_sys, msg, sizeof(struct msghdr)))
>                return -EFAULT;
>
> -       sock = sockfd_lookup_light(fd, &err, &fput_needed);
> -       if (!sock)
> -               goto out;
> -
>        err = -EMSGSIZE;
> -       if (msg_sys.msg_iovlen > UIO_MAXIOV)
> -               goto out_put;
> +       if (msg_sys->msg_iovlen > UIO_MAXIOV)
> +               goto out;
>
>        /* Check whether to allocate the iovec area */
>        err = -ENOMEM;
> -       iov_size = msg_sys.msg_iovlen * sizeof(struct iovec);
> -       if (msg_sys.msg_iovlen > UIO_FASTIOV) {
> +       iov_size = msg_sys->msg_iovlen * sizeof(struct iovec);
> +       if (msg_sys->msg_iovlen > UIO_FASTIOV) {
>                iov = sock_kmalloc(sock->sk, iov_size, GFP_KERNEL);
>                if (!iov)
> -                       goto out_put;
> +                       goto out;
>        }
>
>        /*
> @@ -2021,46 +2028,47 @@ SYSCALL_DEFINE3(recvmsg, int, fd, struct msghdr __user *, msg,
>         *      kernel msghdr to use the kernel address space)
>         */
>
> -       uaddr = (__force void __user *)msg_sys.msg_name;
> +       uaddr = (__force void __user *)msg_sys->msg_name;
>        uaddr_len = COMPAT_NAMELEN(msg);
>        if (MSG_CMSG_COMPAT & flags) {
> -               err = verify_compat_iovec(&msg_sys, iov,
> +               err = verify_compat_iovec(msg_sys, iov,
>                                          (struct sockaddr *)&addr,
>                                          VERIFY_WRITE);
>        } else
> -               err = verify_iovec(&msg_sys, iov,
> +               err = verify_iovec(msg_sys, iov,
>                                   (struct sockaddr *)&addr,
>                                   VERIFY_WRITE);
>        if (err < 0)
>                goto out_freeiov;
>        total_len = err;
>
> -       cmsg_ptr = (unsigned long)msg_sys.msg_control;
> -       msg_sys.msg_flags = flags & (MSG_CMSG_CLOEXEC|MSG_CMSG_COMPAT);
> +       cmsg_ptr = (unsigned long)msg_sys->msg_control;
> +       msg_sys->msg_flags = flags & (MSG_CMSG_CLOEXEC|MSG_CMSG_COMPAT);
>
>        if (sock->file->f_flags & O_NONBLOCK)
>                flags |= MSG_DONTWAIT;
> -       err = sock_recvmsg(sock, &msg_sys, total_len, flags);
> +       err = (nosec ? sock_recvmsg_nosec : sock_recvmsg)(sock, msg_sys,
> +                                                         total_len, flags);
>        if (err < 0)
>                goto out_freeiov;
>        len = err;
>
>        if (uaddr != NULL) {
>                err = move_addr_to_user((struct sockaddr *)&addr,
> -                                       msg_sys.msg_namelen, uaddr,
> +                                       msg_sys->msg_namelen, uaddr,
>                                        uaddr_len);
>                if (err < 0)
>                        goto out_freeiov;
>        }
> -       err = __put_user((msg_sys.msg_flags & ~MSG_CMSG_COMPAT),
> +       err = __put_user((msg_sys->msg_flags & ~MSG_CMSG_COMPAT),
>                         COMPAT_FLAGS(msg));
>        if (err)
>                goto out_freeiov;
>        if (MSG_CMSG_COMPAT & flags)
> -               err = __put_user((unsigned long)msg_sys.msg_control - cmsg_ptr,
> +               err = __put_user((unsigned long)msg_sys->msg_control - cmsg_ptr,
>                                 &msg_compat->msg_controllen);
>        else
> -               err = __put_user((unsigned long)msg_sys.msg_control - cmsg_ptr,
> +               err = __put_user((unsigned long)msg_sys->msg_control - cmsg_ptr,
>                                 &msg->msg_controllen);
>        if (err)
>                goto out_freeiov;
> @@ -2069,21 +2077,150 @@ SYSCALL_DEFINE3(recvmsg, int, fd, struct msghdr __user *, msg,
>  out_freeiov:
>        if (iov != iovstack)
>                sock_kfree_s(sock->sk, iov, iov_size);
> -out_put:
> +out:
> +       return err;
> +}
> +
> +/*
> + *     BSD recvmsg interface
> + */
> +
> +SYSCALL_DEFINE3(recvmsg, int, fd, struct msghdr __user *, msg,
> +               unsigned int, flags)
> +{
> +       int fput_needed, err;
> +       struct msghdr msg_sys;
> +       struct socket *sock = sockfd_lookup_light(fd, &err, &fput_needed);
> +
> +       if (!sock)
> +               goto out;
> +
> +       err = __sys_recvmsg(sock, msg, &msg_sys, flags, 0);
> +
>        fput_light(sock->file, fput_needed);
>  out:
>        return err;
>  }
>
> -#ifdef __ARCH_WANT_SYS_SOCKETCALL
> +/*
> + *     Linux recvmmsg interface
> + */
> +
> +int __sys_recvmmsg(int fd, struct mmsghdr __user *mmsg, unsigned int vlen,
> +                  unsigned int flags, struct timespec *timeout)
> +{
> +       int fput_needed, err, datagrams;
> +       struct socket *sock;
> +       struct mmsghdr __user *entry;
> +       struct msghdr msg_sys;
> +       struct timespec end_time;
> +
> +       if (timeout &&
> +           poll_select_set_timeout(&end_time, timeout->tv_sec,
> +                                   timeout->tv_nsec))
> +               return -EINVAL;
> +
> +       datagrams = 0;
> +
> +       sock = sockfd_lookup_light(fd, &err, &fput_needed);
> +       if (!sock)
> +               return err;
> +
> +       err = sock_error(sock->sk);
> +       if (err)
> +               goto out_put;
> +
> +       entry = mmsg;
> +
> +       while (datagrams < vlen) {
> +               /*
> +                * No need to ask LSM for more than the first datagram.
> +                */
> +               err = __sys_recvmsg(sock, (struct msghdr __user *)entry,
> +                                   &msg_sys, flags, datagrams);
> +               if (err < 0)
> +                       break;
> +               err = put_user(err, &entry->msg_len);
> +               if (err)
> +                       break;
> +               ++entry;
> +               ++datagrams;
> +
> +               if (timeout) {
> +                       ktime_get_ts(timeout);
> +                       *timeout = timespec_sub(end_time, *timeout);
> +                       if (timeout->tv_sec < 0) {
> +                               timeout->tv_sec = timeout->tv_nsec = 0;
> +                               break;
> +                       }
> +
> +                       /* Timeout, return less than vlen datagrams */
> +                       if (timeout->tv_nsec == 0 && timeout->tv_sec == 0)
> +                               break;
> +               }
> +
> +               /* Out of band data, return right away */
> +               if (msg_sys.msg_flags & MSG_OOB)
> +                       break;
> +       }
> +
> +out_put:
> +       fput_light(sock->file, fput_needed);
>
> +       if (err == 0)
> +               return datagrams;
> +
> +       if (datagrams != 0) {
> +               /*
> +                * We may return less entries than requested (vlen) if the
> +                * sock is non block and there aren't enough datagrams...
> +                */
> +               if (err != -EAGAIN) {
> +                       /*
> +                        * ... or  if recvmsg returns an error after we
> +                        * received some datagrams, where we record the
> +                        * error to return on the next call or if the
> +                        * app asks about it using getsockopt(SO_ERROR).
> +                        */
> +                       sock->sk->sk_err = -err;
> +               }
> +
> +               return datagrams;
> +       }
> +
> +       return err;
> +}
> +
> +SYSCALL_DEFINE5(recvmmsg, int, fd, struct mmsghdr __user *, mmsg,
> +               unsigned int, vlen, unsigned int, flags,
> +               struct timespec __user *, timeout)
> +{
> +       int datagrams;
> +       struct timespec timeout_sys;
> +
> +       if (!timeout)
> +               return __sys_recvmmsg(fd, mmsg, vlen, flags, NULL);
> +
> +       if (copy_from_user(&timeout_sys, timeout, sizeof(timeout_sys)))
> +               return -EFAULT;
> +
> +       datagrams = __sys_recvmmsg(fd, mmsg, vlen, flags, &timeout_sys);
> +
> +       if (datagrams > 0 &&
> +           copy_to_user(timeout, &timeout_sys, sizeof(timeout_sys)))
> +               datagrams = -EFAULT;
> +
> +       return datagrams;
> +}
> +
> +#ifdef __ARCH_WANT_SYS_SOCKETCALL
>  /* Argument list sizes for sys_socketcall */
>  #define AL(x) ((x) * sizeof(unsigned long))
> -static const unsigned char nargs[19]={
> +static const unsigned char nargs[20] = {
>        AL(0),AL(3),AL(3),AL(3),AL(2),AL(3),
>        AL(3),AL(3),AL(4),AL(4),AL(4),AL(6),
>        AL(6),AL(2),AL(5),AL(5),AL(3),AL(3),
> -       AL(4)
> +       AL(4),AL(5)
>  };
>
>  #undef AL
> @@ -2103,7 +2240,7 @@ SYSCALL_DEFINE2(socketcall, int, call, unsigned long __user *, args)
>        int err;
>        unsigned int len;
>
> -       if (call < 1 || call > SYS_ACCEPT4)
> +       if (call < 1 || call > SYS_RECVMMSG)
>                return -EINVAL;
>
>        len = nargs[call];
> @@ -2181,6 +2318,10 @@ SYSCALL_DEFINE2(socketcall, int, call, unsigned long __user *, args)
>        case SYS_RECVMSG:
>                err = sys_recvmsg(a0, (struct msghdr __user *)a1, a[2]);
>                break;
> +       case SYS_RECVMMSG:
> +               err = sys_recvmmsg(a0, (struct mmsghdr __user *)a1, a[2], a[3],
> +                                  (struct timespec __user *)a[4]);
> +               break;
>        case SYS_ACCEPT4:
>                err = sys_accept4(a0, (struct sockaddr __user *)a1,
>                                  (int __user *)a[2], a[3]);
> --
> 1.5.5.1
>
>

^ permalink raw reply

* Link bouncing with multiple drivers.
From: Ben Greear @ 2009-10-12 17:55 UTC (permalink / raw)
  To: NetDev

I have a strange issue:

In several different scenarios, when we run pktgen at line speed
on 1G through our proprietary bridge-ish module, we see link go
up and down every minute or so.  This happens with e1000, e1000e,
and have also seen it on ixgbe (though not line speed here...but
we were driving it as hard as the systems could handle).

We cannot reproduce this when a normal bridge is substituted for
our proprietary module, so it must either be a bug in our code somewhere,
or something to do with the fact that our module causes more work to
be done than a bridge (backed up driver queues on rx/tx, time-stamps, etc).

Since it happens across multiple drivers and hardware (and operating systems:  F5, F8, F11,
but all with kernels based on 2.6.31), it must be some general issue.

If anyone has any ideas where I should start poking, I would
be grateful.  For now, I'm off to dig in the e1000e code...

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply

* Re: PATCH: Network Device Naming mechanism and policy
From: Greg KH @ 2009-10-12 17:55 UTC (permalink / raw)
  To: Bill Nottingham
  Cc: Matt Domsch, Stephen Hemminger, netdev, linux-hotplug, Narendra_K,
	jordan_hargrave
In-Reply-To: <20091012174528.GB22736@nostromo.devel.redhat.com>

On Mon, Oct 12, 2009 at 01:45:28PM -0400, Bill Nottingham wrote:
> Greg KH (greg@kroah.com) said: 
> > > Today, port naming is completely nondeterministic.  If you have but
> > > one NIC, there are few chances to get the name wrong (it'll be eth0).
> > > If you have >1 NIC, chances increase to get it wrong.
> > 
> > That is why all distros name network devices based on the only
> > deterministic thing they have today, the MAC address.  I still fail to
> > see why you do not like this solution, it is honestly the only way to
> > properly name network devices in a sane manner.
> > 
> > All distros also provide a way to easily rename the network devices, to
> > place a specific name on a specific MAC address, so again, this should
> > all be solved already.
> 
> No, it's not solved. Even if you have persistent names once you install,
> if you ever re-image, you're likely to get *different* persistent names;
> the first load will always be non-detmerministic.
> 
> The only way around this would be to have some sort of screen like:
> 
>   Would you like your network devices to be enumerated by
> 
>   [ ] MAC address
>   [ ] PCI device order
>   [ ] Driver name
>   [ ] Other

[ ] PCI slot name

That's one that modern systems are now reporting, and should solve
Matt's problem as well, right?

> which is just all sorts of fail in and of itself. Especially since
> once you get to the point where you can coherently ask this in a
> native installer, the drivers have already loaded.

No, the driver load order doesn't determine this, you need the drivers
loaded first before you can rename anything :)

And I don't see how Matt's proposed patch helps resolve this type of
issue any better than what we currently have today, do you?

thanks,

greg k-h

^ permalink raw reply

* Re: PATCH: Network Device Naming mechanism and policy
From: Bill Nottingham @ 2009-10-12 18:07 UTC (permalink / raw)
  To: Greg KH
  Cc: Matt Domsch, Stephen Hemminger, netdev, linux-hotplug, Narendra_K,
	jordan_hargrave
In-Reply-To: <20091012175508.GA10946@kroah.com>

Greg KH (greg@kroah.com) said: 
> > No, it's not solved. Even if you have persistent names once you install,
> > if you ever re-image, you're likely to get *different* persistent names;
> > the first load will always be non-detmerministic.
> > 
> > The only way around this would be to have some sort of screen like:
> > 
> >   Would you like your network devices to be enumerated by
> > 
> >   [ ] MAC address
> >   [ ] PCI device order
> >   [ ] Driver name
> >   [ ] Other
> 
> [ ] PCI slot name
> 
> That's one that modern systems are now reporting, and should solve
> Matt's problem as well, right?

... maybe. On my laptop, the first 'slot' enumerated appears to be
the cardbus bridge, before the on-board ethernet. And on the desktop
next to me, the slot driver shows nothing.

> And I don't see how Matt's proposed patch helps resolve this type of
> issue any better than what we currently have today, do you?

It allows multiple addressing schemes to be active at once, which
can allow the admin to choose post-install without making an
active choice at installation. This is an improvement, even if
it doesn't solve the world.

Bill

^ permalink raw reply

* Re: PATCH: Network Device Naming mechanism and policy
From: Narendra K @ 2009-10-12 18:07 UTC (permalink / raw)
  To: notting, scott
  Cc: matt_domsch, netdev, linux-hotplug, jordan_hargrave, charles_rose
In-Reply-To: <EDA0A4495861324DA2618B4C45DCB3EE58953F@blrx3m08.blr.amer.dell.com>

> > This makes them pretty comparable to LABELs on disks, and we have a 
> > /dev/disk/by-label
> > 
> > Remember that udev already supports symlink stacking, and priorities 
> > and such.
> > 
> > I don't think there's any danger of supporting a /dev/netdev/by-mac by
> 
> > default, it'll be a benefit to most and those who don't have unique 
> > MACs will just ignore it.
> 
> At the moment, we do not appear to get the proper change uevents from
> things like 'ip link set dev <foo> address <bar>', so we can't currently
> maintain these symlinks.
> 

I have observed that the kernel does generate a "move" event when
interfaces are renamed. Looks like udev at present doesn't handle this event, but i
suppose it could be extended to hanlde this event.

With regards,
Narendra K

^ permalink raw reply

* Re: PATCH: Network Device Naming mechanism and policy
From: Greg KH @ 2009-10-12 18:15 UTC (permalink / raw)
  To: Bill Nottingham
  Cc: Matt Domsch, Stephen Hemminger, netdev, linux-hotplug, Narendra_K,
	jordan_hargrave
In-Reply-To: <20091012180740.GE22736@nostromo.devel.redhat.com>

On Mon, Oct 12, 2009 at 02:07:42PM -0400, Bill Nottingham wrote:
> Greg KH (greg@kroah.com) said: 
> > > No, it's not solved. Even if you have persistent names once you install,
> > > if you ever re-image, you're likely to get *different* persistent names;
> > > the first load will always be non-detmerministic.
> > > 
> > > The only way around this would be to have some sort of screen like:
> > > 
> > >   Would you like your network devices to be enumerated by
> > > 
> > >   [ ] MAC address
> > >   [ ] PCI device order
> > >   [ ] Driver name
> > >   [ ] Other
> > 
> > [ ] PCI slot name
> > 
> > That's one that modern systems are now reporting, and should solve
> > Matt's problem as well, right?
> 
> ... maybe. On my laptop, the first 'slot' enumerated appears to be
> the cardbus bridge, before the on-board ethernet. And on the desktop
> next to me, the slot driver shows nothing.

On servers, where this matters (multiple ethernet pci devices), this
should all be present if the manufacturer wants it to be, as it is just
an ACPI table entry.

> > And I don't see how Matt's proposed patch helps resolve this type of
> > issue any better than what we currently have today, do you?
> 
> It allows multiple addressing schemes to be active at once, which
> can allow the admin to choose post-install without making an
> active choice at installation. This is an improvement, even if
> it doesn't solve the world.

But these different names can not be used by the networking stack, or in
scripts, as others have pointed out.  Which seems to be the big problem
here.

thanks,

greg k-h

^ permalink raw reply

* Re: netinet/ip.h and DSCP
From: Philip A. Prindeville @ 2009-10-12 18:20 UTC (permalink / raw)
  To: netdev
In-Reply-To: <4AC68714.5080909@redfish-solutions.com>

Since linux-net doesn't seem to be the right forum...


On 10/02/2009 04:04 PM, Philip A. Prindeville wrote:
> Is there a reason that /usr/include/netinet/ip.h only contains
> definitions for ToS (precedence-based) markings?
>
> RFCs 2597/2598 have been out a *long* time... why is Linux so far behind
> the learning curve?
>
> We're discussing (at least here in the US) net neutrality...  but if
> we're not marking our traffic the way we want it carried *anyway*, don't
> we deserve what we get if the carriers reshape our traffic by their own
> rules?
>
> Should there be definitions in netinet/ip.h like:
>
> #define IPTOS_DSCP_MASK		0xd0
> #define IPTOS_DSCP(x)		((x) & IPTOS_DSCP_MASK)
> #define IPTOS_DSCP_AF11		0x28
> #define IPTOS_DSCP_AF12		0x30
> #define IPTOS_DSCP_AF13		0x38
> #define IPTOS_DSCP_AF21		0x48
> #define IPTOS_DSCP_AF22		0x50
> #define IPTOS_DSCP_AF23		0x58
> #define IPTOS_DSCP_AF31		0x68
> #define IPTOS_DSCP_AF32		0x70
> #define IPTOS_DSCP_AF33		0x78
> #define IPTOS_DSCP_AF41		0x88
> #define IPTOS_DSCP_AF42		0x90
> #define IPTOS_DSCP_AF43		0x98
> #define IPTOS_DSCP_EF		0xb8
>
>
> Should be simple enough, right?
>
> Thanks,
>
> -Philip
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-net" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>   


^ permalink raw reply

* Re: PATCH: Network Device Naming mechanism and policy
From: Rob Townley @ 2009-10-12 18:35 UTC (permalink / raw)
  To: Greg KH
  Cc: Matt Domsch, Stephen Hemminger, netdev, linux-hotplug, Narendra_K,
	jordan_hargrave
In-Reply-To: <20091012030008.GA8436@kroah.com>

On Sun, Oct 11, 2009 at 10:00 PM, Greg KH <greg@kroah.com> wrote:
> On Sun, Oct 11, 2009 at 04:10:03PM -0500, Rob Townley wrote:
>> So when an add-in PCI NIC has a lower MAC than the motherboard NICs,
>> the add-in cards will come before the motherboard NICs.   i don't like it.
>
> Huh?  Have you used the MAC persistant rules?  If you add a new card,
> what does it pick for it?

i have a hp-dl360 (two nics) with a fibre optic add in nic.  On a
fresh install, the add-in is eth0.  i didn't like it, but ran it for
years.

>
>> But please whatever is done, make sure ping and tracert still works when
>> telling it to use a ethX source interface:
>>
>> eth0 = 4.3.2.8, the default gateway is thru eth1.
>> ping -I eth0 208.67.222.222              FAILS
>> ping -I 4.3.2.8 208.67.222.222          WORKS
>> tracert -i eth0 -I 208.67.222.222        FAILS
>> tracert -s 4.3.2.8 -I 208.67.222.222   WORKS
>> tracert -i eth0 208.67.222.222           FAILS
>> tracert -s 4.3.2.8 208.67.222.222      WORKS
>
> Again, is what we currently have broken?  I am confused as to what this
> is referring to.

Yes, ping and traceroute are broken at least on Fedora, CentOS, and busybox.
On a multinic, multigatewayed machine, passing ethX instead of the IP
address will give the false result: "Destination Host Unreachable"
when the machine's default gateway is reached thru the other nic.   In
the following example, the default gateway is thru eth1, not eth0.
Pay attention to the text between the '*****'.

ping -c 1 -B -I  eth0 208.67.222.222
PING 208.67.222.222 (208.67.222.222) from ***** 4.3.2.8 eth0*****:
56(84) bytes of data.
From 4.3.2.8 icmp_seq=1 Destination Host Unreachable

#ping -c 1 -B -I  4.3.2.8 208.67.222.222
PING 208.67.222.222 (208.67.222.222) from ***** 4.3.2.8 *****: 56(84)
bytes of data.
64 bytes from 208.67.222.222: icmp_seq=1 ttl=55 time=562 ms



>
> greg k-h
>

^ permalink raw reply

* Re: bisect results of MSI-X related panic (help!)
From: Brandeburg, Jesse @ 2009-10-12 18:00 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jesse Brandeburg, Frans Pop, linux-kernel@vger.kernel.org,
	netdev@vger.kernel.org, Ingo Molnar, hpa@zytor.com
In-Reply-To: <4AD2E05A.6060700@kernel.org>

On Mon, 12 Oct 2009, Tejun Heo wrote:
> > any other debugging tricks/ideas?
> 
> Hmm... stackprotector adds considerable amount of stack usage and it
> could be you're seeing stack overflow which would also explain the
> random crashes you've been seeing.  Do you have DEBUG_STACKOVERFLOW
> turned on?  This is on x86_64, right?

Hi, thanks for your response, 

[root@jbrandeb-hc linux-2.6.32-rc1]# grep STACKO .config
CONFIG_DEBUG_STACKOVERFLOW=y

[root@jbrandeb-hc linux-2.6.32-rc1]# grep X86_64 .config
CONFIG_X86_64=y
CONFIG_X86_64_SMP=y
CONFIG_X86_64_ACPI_NUMA=y

stack size is 8K

I tried Jarek's suggestion of CPUMASK_OFFSTACK and still panic.
[66027.266057] Kernel panic - not syncing: stack-protector: Kernel stack 
is corrupted in: ffffffff810b4eb0
[66027.266059]
[66027.266070] Kernel panic - not syncing: stack-protector: Kernel stack 
is corrupted in: ffffffff81472856
[66027.266071]
[66027.266081] Pid: 0, comm: swapper Tainted: G        W  
2.6.32-rc2-git-debug #6
[66027.266086] Call Trace:

that was all I got.  Interesting double fault, that hadn't happened 
before.

the symbols might be off slightly since I rebuilt the kernel, but this was 
initial poke at offsets above in gdb
(gdb) l *0xffffffff810b4eb0
0xffffffff810b4eb0 is in dynamic_irq_cleanup (kernel/irq/chip.c:86).
81              desc->handle_irq = handle_bad_irq;
82              desc->chip = &no_irq_chip;
83              desc->name = NULL;
84              clear_kstat_irqs(desc);
85              spin_unlock_irqrestore(&desc->lock, flags);
86      }
87
88
89      /**
90       *      set_irq_chip - set the irq chip for an irq
(gdb) l *0xffffffff8147285
No source file for address 0xffffffff8147285.
(gdb) l *0xffffffff81472856
0xffffffff81472856 is in show_kprobe_addr (kernel/kprobes.c:1306).
1301            struct hlist_head *head;
1302            struct hlist_node *node;
1303            struct kprobe *p, *kp;
1304            const char *sym = NULL;
1305            unsigned int i = *(loff_t *) v;
1306            unsigned long offset = 0;
1307            char *modname, namebuf[128];
1308
1309            head = &kprobe_table[i];
1310            preempt_disable();



^ permalink raw reply

* Re: PATCH: Network Device Naming mechanism and policy
From: Matt Domsch @ 2009-10-12 18:44 UTC (permalink / raw)
  To: Rob Townley
  Cc: Greg KH, Stephen Hemminger, netdev, linux-hotplug, Narendra_K,
	jordan_hargrave
In-Reply-To: <7e84ed60910121135j656d1d9s8d84757e7e3d0078@mail.gmail.com>

On Mon, Oct 12, 2009 at 01:35:25PM -0500, Rob Townley wrote:
> > Again, is what we currently have broken?  I am confused as to what this
> > is referring to.
> 
> Yes, ping and traceroute are broken at least on Fedora, CentOS, and busybox.
> On a multinic, multigatewayed machine, passing ethX instead of the IP
> address will give the false result: "Destination Host Unreachable"
> when the machine's default gateway is reached thru the other nic.   In
> the following example, the default gateway is thru eth1, not eth0.

Unrelated to this thread.  We're having a hard enough time making sure
this conversation accurately reflects the views and needs of everyone
involved.  Please let's not throw in another tangent.

Thanks,
Matt

-- 
Matt Domsch
Technology Strategist, Dell Office of the CTO
linux.dell.com & www.dell.com/linux

^ permalink raw reply

* Re: PATCH: Network Device Naming mechanism and policy
From: Narendra K @ 2009-10-12 18:47 UTC (permalink / raw)
  To: greg, notting
  Cc: matt_domsch, netdev, shemminger, linux-hotplug, jordan_hargrave,
	charles_rose
In-Reply-To: <EDA0A4495861324DA2618B4C45DCB3EE589541@blrx3m08.blr.amer.dell.com>

> On Mon, Oct 12, 2009 at 01:45:28PM -0400, Bill Nottingham wrote:
> > Greg KH (greg@kroah.com) said: 
> > > > Today, port naming is completely nondeterministic.  If you have 
> > > > but one NIC, there are few chances to get the name wrong (it'll be
> eth0).
> > > > If you have >1 NIC, chances increase to get it wrong.
> > > 
> > > That is why all distros name network devices based on the only 
> > > deterministic thing they have today, the MAC address.  I still fail 
> > > to see why you do not like this solution, it is honestly the only 
> > > way to properly name network devices in a sane manner.
> > > 
> > > All distros also provide a way to easily rename the network devices,
> 
> > > to place a specific name on a specific MAC address, so again, this 
> > > should all be solved already.
> > 
> > No, it's not solved. Even if you have persistent names once you 
> > install, if you ever re-image, you're likely to get *different* 
> > persistent names; the first load will always be non-detmerministic.
> > 
> > The only way around this would be to have some sort of screen like:
> > 
> >   Would you like your network devices to be enumerated by
> > 
> >   [ ] MAC address
> >   [ ] PCI device order
> >   [ ] Driver name
> >   [ ] Other
> 
> [ ] PCI slot name
> 
> That's one that modern systems are now reporting, and should solve
> Matt's problem as well, right?

MAC address and pci slots might ensure that device names are persistant
across system reboots. They do not assure that the LOM 1 is named as
"eth0" which is the expectation. In case of unattended installs,
installers abort installation if the port which gets the name "eth0"
does not have the link up and doesn't have the IP.This is often the case
becaused the LOMS have the boot capability. We can acheive
persistent naming using MAC adresses. But it doesn't address the
expectation that LOM-1 becomes "eth0" on every reboot which is mostly
used for unattended installs.(Installers can be told to use options like
IPAPPEND 2, but the this solution would make it of no use).

> > which is just all sorts of fail in and of itself. Especially since 
> > once you get to the point where you can coherently ask this in a 
> > native installer, the drivers have already loaded.
> 
> No, the driver load order doesn't determine this, you need the drivers
> loaded first before you can rename anything :)
> 

Renaming an interface in the kernel namespace itself, might need to
problems like duplicate names. But having names in alternate namespace not in kernel
namespace might be more useful.

> And I don't see how Matt's proposed patch helps resolve this type of
> issue any better than what we currently have today, do you?
> 

I have a system which has 4 LOMS and 1 add-in NIC and the add-in NIC
always gets the name "eth0" eventhough i PXE booted from LOM-1. Since
"eth0" doesn't have link up, the installer stops and asks which
interface should get IP. This would not suit an unattended install
scenario. If the installer can use a pathname like
/dev/net/by-chassis-label/Embedded_NIC_1 (->eth1 which is my LOM-1), it
would always point to the correct interface irrespective of whether it
is "eth0" or not.

With regards,
Narendra K

^ permalink raw reply

* Re: [PATCH] ax25: unsigned cannot be less than 0 in ax25_ctl_ioctl()
From: Roel Kluin @ 2009-10-12 19:02 UTC (permalink / raw)
  To: wharms; +Cc: linux-hams, netdev, Joerg Reuter, Andrew Morton
In-Reply-To: <4AD34D3C.1060206@bfs.de>

struct ax25_ctl_struct member `arg' is unsigned and cannot be less
than 0.

Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
---
>> If the ax25_ctl.arg limit is known to be lower, please suggest
>> other values.

> what is about something like:
> 
>  tmp_arg=ax25_ctl.arg * HZ;
> 
>   if (arg == 0 || arg >  ULONG_MAX )
> 		goto einval_put;
> 
> re,
>  wh

I'm not sure, I think this would only work if we made `arg' an
unsigned long long.

How about this?

diff --git a/net/ax25/af_ax25.c b/net/ax25/af_ax25.c
index f454607..20ff0f3 100644
--- a/net/ax25/af_ax25.c
+++ b/net/ax25/af_ax25.c
@@ -369,6 +369,9 @@ static int ax25_ctl_ioctl(const unsigned int cmd, void __user *arg)
 	if (ax25_ctl.digi_count > AX25_MAX_DIGIS)
 		return -EINVAL;
 
+	if (ax25_ctl.arg * HZ > ULONG_MAX && ax25_ctl.cmd != AX25_KILL)
+		return -EINVAL;
+
 	digi.ndigi = ax25_ctl.digi_count;
 	for (k = 0; k < digi.ndigi; k++)
 		digi.calls[k] = ax25_ctl.digi_addr[k];
@@ -418,14 +421,10 @@ static int ax25_ctl_ioctl(const unsigned int cmd, void __user *arg)
 		break;
 
 	case AX25_T3:
-		if (ax25_ctl.arg < 0)
-			goto einval_put;
 		ax25->t3 = ax25_ctl.arg * HZ;
 		break;
 
 	case AX25_IDLE:
-		if (ax25_ctl.arg < 0)
-			goto einval_put;
 		ax25->idle = ax25_ctl.arg * 60 * HZ;
 		break;
 

^ permalink raw reply related

* Re: PATCH: Network Device Naming mechanism and policy
From: Greg KH @ 2009-10-12 19:09 UTC (permalink / raw)
  To: Narendra K
  Cc: notting, matt_domsch, netdev, shemminger, linux-hotplug,
	jordan_hargrave, charles_rose
In-Reply-To: <20091012184711.GA4836@mock.linuxdev.us.dell.com>

On Mon, Oct 12, 2009 at 01:47:12PM -0500, Narendra K wrote:
> > On Mon, Oct 12, 2009 at 01:45:28PM -0400, Bill Nottingham wrote:
> > > Greg KH (greg@kroah.com) said: 
> > > > > Today, port naming is completely nondeterministic.  If you have 
> > > > > but one NIC, there are few chances to get the name wrong (it'll be
> > eth0).
> > > > > If you have >1 NIC, chances increase to get it wrong.
> > > > 
> > > > That is why all distros name network devices based on the only 
> > > > deterministic thing they have today, the MAC address.  I still fail 
> > > > to see why you do not like this solution, it is honestly the only 
> > > > way to properly name network devices in a sane manner.
> > > > 
> > > > All distros also provide a way to easily rename the network devices,
> > 
> > > > to place a specific name on a specific MAC address, so again, this 
> > > > should all be solved already.
> > > 
> > > No, it's not solved. Even if you have persistent names once you 
> > > install, if you ever re-image, you're likely to get *different* 
> > > persistent names; the first load will always be non-detmerministic.
> > > 
> > > The only way around this would be to have some sort of screen like:
> > > 
> > >   Would you like your network devices to be enumerated by
> > > 
> > >   [ ] MAC address
> > >   [ ] PCI device order
> > >   [ ] Driver name
> > >   [ ] Other
> > 
> > [ ] PCI slot name
> > 
> > That's one that modern systems are now reporting, and should solve
> > Matt's problem as well, right?
> 
> MAC address and pci slots might ensure that device names are persistant
> across system reboots. They do not assure that the LOM 1 is named as
> "eth0" which is the expectation.

"LOM"?

Isn't what you want is a PCI slot detection, combined with the order on
board in which the port is enumerated?

> In case of unattended installs, installers abort installation if the
> port which gets the name "eth0" does not have the link up and doesn't
> have the IP.

Sounds like a broken installer :)

> This is often the case becaused the LOMS have the boot capability. We
> can acheive persistent naming using MAC adresses. But it doesn't
> address the expectation that LOM-1 becomes "eth0" on every reboot
> which is mostly used for unattended installs.(Installers can be told
> to use options like IPAPPEND 2, but the this solution would make it of
> no use).

I still fail to see how this dummy char device would solve this problem,
as everything you can do today in userspace would be the same with this
device node as you can't do anything with the symlink name on its own,
right?

> > > which is just all sorts of fail in and of itself. Especially since 
> > > once you get to the point where you can coherently ask this in a 
> > > native installer, the drivers have already loaded.
> > 
> > No, the driver load order doesn't determine this, you need the drivers
> > loaded first before you can rename anything :)
> > 
> 
> Renaming an interface in the kernel namespace itself, might need to
> problems like duplicate names. But having names in alternate namespace
> not in kernel namespace might be more useful.

Not if you can't do anything useful with those names :)

> > And I don't see how Matt's proposed patch helps resolve this type of
> > issue any better than what we currently have today, do you?
> > 
> 
> I have a system which has 4 LOMS and 1 add-in NIC and the add-in NIC
> always gets the name "eth0" eventhough i PXE booted from LOM-1. Since
> "eth0" doesn't have link up, the installer stops and asks which
> interface should get IP. This would not suit an unattended install
> scenario. If the installer can use a pathname like
> /dev/net/by-chassis-label/Embedded_NIC_1 (->eth1 which is my LOM-1), it
> would always point to the correct interface irrespective of whether it
> is "eth0" or not.

Um, again, you can name your network devices like this today, without
these symlinks...

thanks,

greg k-h

^ permalink raw reply

* Re: Ping Is Broken
From: Rob Townley @ 2009-10-12 19:14 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: CentOS mailing list, public-netdev-u79uwXL29TY76Z2rM5mHXA,
	Omaha Linux User Group
In-Reply-To: <20091012094752.GA8114@ff.dom.local>




On Mon, Oct 12, 2009 at 4:47 AM, Jarek Poplawski <jarkao2@gmail.com> wrote:
>
>
> On 09-10-2009 18:44, Rob Townley wrote:
>> ping -I is broken
>>
>> The following deals with bug in ping that made it very difficult to set up a
>> system with two gateways.
>>
>> Demonstration that *ping -I is broken*. When specifying the source
>> interface using -I with an *ethX* alias and that interface is not the
>> default gateway
>> interface, then ping fails. When specifying the interface as an ip address,
>> ping works. Search for "Destination Host Unreachable" to find the bug.
>>
>>
>> eth*0* = 4.3.2.8 and the default gateway is accessed through a different
>> interface eth*1*.
>> eth*1* = 192.168.168.155 is used as the device to get to the default
>> gateway.
>> *FAILS *: ping *-I eth0* 208.67.222.222
>> *WORKS*: ping *-I 4.3.2.8* 208.67.222.222
>> *WORKS*: ping *-I eth1* 208.67.222.222
>> *WORKS*: ping *-I 192.168.168.155* 208.67.222.222
> ...
>> man ping:
>>    -I interface address
>>         Set source address to specified interface address.
>>         Argument may be *numeric IP address or name of device*.
>>         When  pinging  IPv6  link-local  address  this option is required.
>
> It seems this description might be misleading that IP address and name
> of device are equivalent here, while they are treated a bit different.
> The device name is additionally used in a sendmsg message, probably to
> guarantee the device is really used (not its address only), so it
> looks like intended.
>
>> ping -V returns the latest available on CentOS and Fedora and the
>> maintainers website:
>> ping utility, iputils-ss020927
>
> I guess the patch below could do what you expect in this case, but
> rather "man" should be fixed...

Thank you for the patch.  i will test it. i was trying to find the
problem using gdb and figure out a patch myself.

ping used to work the way i expected many many years ago on various
*nix systems.
Besides, traceroute is broken by the same problem except that
traceroute is much more explicit with a -i and -s parameters.  Who
knows what else is broken by all the meddling in interface name
aliases without testing.

MultiNic / MultiGatewayed machines are hard enough in Linux, lets not
give users a reason to use BSD or Windows.

>
> Jarek P.
> ---
>
> --- ping.c.orig 2002-09-20 15:08:11.000000000 +0000
> +++ ping.c      2009-10-12 08:51:25.000000000 +0000
> @@ -323,7 +323,7 @@ main(int argc, char **argv)
>                perror("ping: icmp open socket");
>                exit(2);
>        }
> -
> +#if 0
>        if (device) {
>                struct ifreq ifr;
>
> @@ -336,7 +336,7 @@ main(int argc, char **argv)
>                cmsg.ipi.ipi_ifindex = ifr.ifr_ifindex;
>                cmsg_len = sizeof(cmsg);
>        }
> -
> +#endif
>        if (broadcast_pings || IN_MULTICAST(ntohl(whereto.sin_addr.s_addr))) {
>                if (uid) {
>                        if (interval < 1000) {
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



^ permalink raw reply

* [PATCH 1/3] mdio: Advertise pause (flow control) settings even if autoneg is off
From: Ben Hutchings @ 2009-10-12 19:26 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-net-drivers

Currently, if pause autoneg is off we do not set either pause
advertising flag.  If autonegotiation of speed and duplex settings is
enabled, there is no way for the link partner to distinguish this from
our refusing to use pause frames.

We should instead set the advertising flags according to the forced
mode so that the link partner can follow our lead.  This is consistent
with the behaviour of other drivers.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
 drivers/net/mdio.c |    8 +++-----
 1 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/net/mdio.c b/drivers/net/mdio.c
index 21f8754..c0db9d7 100644
--- a/drivers/net/mdio.c
+++ b/drivers/net/mdio.c
@@ -344,11 +344,9 @@ void mdio45_ethtool_spauseparam_an(const struct mdio_if_info *mdio,
 
 	old_adv = mdio->mdio_read(mdio->dev, mdio->prtad, MDIO_MMD_AN,
 				  MDIO_AN_ADVERTISE);
-	adv = old_adv & ~(ADVERTISE_PAUSE_CAP | ADVERTISE_PAUSE_ASYM);
-	if (ecmd->autoneg)
-		adv |= mii_advertise_flowctrl(
-			(ecmd->rx_pause ? FLOW_CTRL_RX : 0) |
-			(ecmd->tx_pause ? FLOW_CTRL_TX : 0));
+	adv = ((old_adv & ~(ADVERTISE_PAUSE_CAP | ADVERTISE_PAUSE_ASYM)) |
+	       mii_advertise_flowctrl((ecmd->rx_pause ? FLOW_CTRL_RX : 0) |
+				      (ecmd->tx_pause ? FLOW_CTRL_TX : 0)));
 	if (adv != old_adv) {
 		mdio->mdio_write(mdio->dev, mdio->prtad, MDIO_MMD_AN,
 				 MDIO_AN_ADVERTISE, adv);

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply related

* [PATCH 2/3] mdio: Expose pause frame advertising flags to ethtool
From: Ben Hutchings @ 2009-10-12 19:26 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-net-drivers

In mdio45_ethtool_gset_npage() and mdio45_ethtool_gset(), check MDIO
pause frame advertising flags and set the corresponding ethtool flags.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
 drivers/net/mdio.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/drivers/net/mdio.c b/drivers/net/mdio.c
index c0db9d7..e85bf04 100644
--- a/drivers/net/mdio.c
+++ b/drivers/net/mdio.c
@@ -162,6 +162,10 @@ static u32 mdio45_get_an(const struct mdio_if_info *mdio, u16 addr)
 		result |= ADVERTISED_100baseT_Half;
 	if (reg & ADVERTISE_100FULL)
 		result |= ADVERTISED_100baseT_Full;
+	if (reg & ADVERTISE_PAUSE_CAP)
+		result |= ADVERTISED_Pause;
+	if (reg & ADVERTISE_PAUSE_ASYM)
+		result |= ADVERTISED_Asym_Pause;
 	return result;
 }
 

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply related

* [PATCH 3/3] sfc: 10Xpress: Initialise pause advertising flags
From: Ben Hutchings @ 2009-10-12 19:27 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-net-drivers

The mdio module now handles reconfiguration of pause advertising
through ethtool, but not initialisation.  Add the necessary
initialisation to tenxpress_phy_init().

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
 drivers/net/sfc/tenxpress.c |   10 ++++++++++
 1 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/drivers/net/sfc/tenxpress.c b/drivers/net/sfc/tenxpress.c
index f4d5090..1a3495c 100644
--- a/drivers/net/sfc/tenxpress.c
+++ b/drivers/net/sfc/tenxpress.c
@@ -301,6 +301,7 @@ static int tenxpress_init(struct efx_nic *efx)
 static int tenxpress_phy_init(struct efx_nic *efx)
 {
 	struct tenxpress_phy_data *phy_data;
+	u16 old_adv, adv;
 	int rc = 0;
 
 	phy_data = kzalloc(sizeof(*phy_data), GFP_KERNEL);
@@ -333,6 +334,15 @@ static int tenxpress_phy_init(struct efx_nic *efx)
 	if (rc < 0)
 		goto fail;
 
+	/* Set pause advertising */
+	old_adv = efx_mdio_read(efx, MDIO_MMD_AN, MDIO_AN_ADVERTISE);
+	adv = ((old_adv & ~(ADVERTISE_PAUSE_CAP | ADVERTISE_PAUSE_ASYM)) |
+	       mii_advertise_flowctrl(efx->wanted_fc));
+	if (adv != old_adv) {
+		efx_mdio_write(efx, MDIO_MMD_AN, MDIO_AN_ADVERTISE, adv);
+		mdio45_nway_restart(&efx->mdio);
+	}
+
 	if (efx->phy_type == PHY_TYPE_SFT9001B) {
 		rc = device_create_file(&efx->pci_dev->dev,
 					&dev_attr_phy_short_reach);

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply related

* Re: PATCH: Network Device Naming mechanism and policy
From: Karl O. Pinc @ 2009-10-12 19:41 UTC (permalink / raw)
  To: Greg KH
  Cc: Narendra K, notting, matt_domsch, netdev, shemminger,
	linux-hotplug, jordan_hargrave, charles_rose
In-Reply-To: <20091012190900.GA514@kroah.com>

On 10/12/2009 02:09:00 PM, Greg KH wrote:

> "LOM"?

"LAN On Motherboard" of all things.  

(I had to look this
one up.  The expansion better suited
to today's economy is "Low On Manna".)


Karl <kop@meme.com>
Free Software:  "You don't pay back, you pay forward."
                 -- Robert A. Heinlein


^ permalink raw reply

* [PATCH] Fix IXP 2000 network driver building.
From: Vincent Sanders @ 2009-10-12 19:46 UTC (permalink / raw)
  To: netdev; +Cc: Vincent Sanders

The IXP 2000 network driver was failing to build as it has its own
statistics gathering which was not compatible with the recent network
device operations changes. This patch fixes the driver in the obvious
way and has been compile tested. I have been unable to get the ixp2000
maintainer to comment or test this fix.

Signed-off-by: Vincent Sanders <vince@simtec.co.uk>
---
 drivers/net/ixp2000/enp2611.c |   18 +-----------------
 drivers/net/ixp2000/ixpdev.c  |   11 +++++++++++
 2 files changed, 12 insertions(+), 17 deletions(-)

diff --git a/drivers/net/ixp2000/enp2611.c b/drivers/net/ixp2000/enp2611.c
index b02a981..34a6cfd 100644
--- a/drivers/net/ixp2000/enp2611.c
+++ b/drivers/net/ixp2000/enp2611.c
@@ -119,24 +119,9 @@ static struct ixp2400_msf_parameters enp2611_msf_parameters =
 	}
 };
 
-struct enp2611_ixpdev_priv
-{
-	struct ixpdev_priv		ixpdev_priv;
-	struct net_device_stats		stats;
-};
-
 static struct net_device *nds[3];
 static struct timer_list link_check_timer;
 
-static struct net_device_stats *enp2611_get_stats(struct net_device *dev)
-{
-	struct enp2611_ixpdev_priv *ip = netdev_priv(dev);
-
-	pm3386_get_stats(ip->ixpdev_priv.channel, &(ip->stats));
-
-	return &(ip->stats);
-}
-
 /* @@@ Poll the SFP moddef0 line too.  */
 /* @@@ Try to use the pm3386 DOOL interrupt as well.  */
 static void enp2611_check_link_status(unsigned long __dummy)
@@ -203,14 +188,13 @@ static int __init enp2611_init_module(void)
 
 	ports = pm3386_port_count();
 	for (i = 0; i < ports; i++) {
-		nds[i] = ixpdev_alloc(i, sizeof(struct enp2611_ixpdev_priv));
+		nds[i] = ixpdev_alloc(i, sizeof(struct ixpdev_priv));
 		if (nds[i] == NULL) {
 			while (--i >= 0)
 				free_netdev(nds[i]);
 			return -ENOMEM;
 		}
 
-		nds[i]->get_stats = enp2611_get_stats;
 		pm3386_init_port(i);
 		pm3386_get_mac(i, nds[i]->dev_addr);
 	}
diff --git a/drivers/net/ixp2000/ixpdev.c b/drivers/net/ixp2000/ixpdev.c
index 1272434..9aee0cc 100644
--- a/drivers/net/ixp2000/ixpdev.c
+++ b/drivers/net/ixp2000/ixpdev.c
@@ -21,6 +21,7 @@
 #include "ixp2400_tx.ucode"
 #include "ixpdev_priv.h"
 #include "ixpdev.h"
+#include "pm3386.h"
 
 #define DRV_MODULE_VERSION	"0.2"
 
@@ -271,6 +272,15 @@ static int ixpdev_close(struct net_device *dev)
 	return 0;
 }
 
+static struct net_device_stats *ixpdev_get_stats(struct net_device *dev)
+{
+	struct ixpdev_priv *ip = netdev_priv(dev);
+
+	pm3386_get_stats(ip->channel, &(dev->stats));
+
+	return &(dev->stats);
+}
+
 static const struct net_device_ops ixpdev_netdev_ops = {
 	.ndo_open		= ixpdev_open,
 	.ndo_stop		= ixpdev_close,
@@ -278,6 +288,7 @@ static const struct net_device_ops ixpdev_netdev_ops = {
 	.ndo_change_mtu		= eth_change_mtu,
 	.ndo_validate_addr	= eth_validate_addr,
 	.ndo_set_mac_address	= eth_mac_addr,
+	.ndo_get_stats		= ixpdev_get_stats,
 #ifdef CONFIG_NET_POLL_CONTROLLER
 	.ndo_poll_controller	= ixpdev_poll_controller,
 #endif
-- 
1.6.0.4


^ permalink raw reply related

* Re: PATCH: Network Device Naming mechanism and policy
From: Matt Domsch @ 2009-10-12 19:48 UTC (permalink / raw)
  To: Greg KH
  Cc: Narendra K, notting, netdev, shemminger, linux-hotplug,
	jordan_hargrave, charles_rose
In-Reply-To: <20091012190900.GA514@kroah.com>

On Mon, Oct 12, 2009 at 12:09:00PM -0700, Greg KH wrote:
> "LOM"?

LAN on Motherboard (e.g. an embedded NIC, as opposed to being in some
slot).

> Isn't what you want is a PCI slot detection, combined with the order on
> board in which the port is enumerated?

Most folks do, yes.

> I still fail to see how this dummy char device would solve this problem,
> as everything you can do today in userspace would be the same with this
> device node as you can't do anything with the symlink name on its own,
> right?

You are correct, the char device by itself doesn't help with this.
You noted earlier, the char device is really only needed if we want to
be able have multiple names for the same device, only exposed in
userspace.

If all we want to do is change the namespace for devices the kernel
uses, from "ethN" to something else, we can do that with a single
simple rename.  And biosdevname has several --policy=[] options to
provide that.

--policy=smbios_names => "Embedded NIC 1", "PCI2"
--policy=kernelnames  => "eth0"  (kind of pointless, but included for completeness)
--policy=all_ethN     => "eth0..ethN" in ascending slot order, embedded
                         before slots, within a single slot in PCI
                         breadth-first order, and thereafter in MAC
                         address order if really needed.
--policy=all_names    => "eth_s0_0" for the first embedded NIC in PCI
                         breadth-first order, "eth_s1_1" for the
                         second NIC port in PCI slot 1, again in
                         breadth-first order.
--policy=embedded_ethN_slots_names   a combo of the above, but making
                        the embeddeds still retain the "eth0" format
                        and the slots get "eth_s1_1" format.

We could add a dozen more.  all_ethN, and to a lesser extent,
embedded_ethN, are bad choices if biosdevname is invoked by udev on
every run (e.g. not using persistent rules), because when it's run,
userspace doesn't know if there are more drivers to be loaded yet, and
so biosdevname can't know if there are more NICs to
include in the enumeration, to get the naming right.  (yet another
example of enumeration != naming).

Now, for --policy=smbios_names, we get lucky in that the string length
returned from SMBIOS is 14 characters, it fits in IFNAMSZ.  We may not
always get so lucky, SMBIOS strings are arbitrary lengths.  This works
somewhat better as a symlink source, as that can be longer than
IFNAMSZ long.

--policy=all_names is pretty good.  It fits, it lines up to a fairly
obvious hardware mapping.  It breaks any code that assumes a regular
expression eth[[:digit:]]+ for the name.

By having a single name in the kernel for a particular device, it
forces a sysadmin to choose one naming policy.  We can't have multiple
names for the same device (like we do for disks).  And conceptually,
I'd like to be able to have a physical-based naming scheme (all_names)
for use at installtime and mechanical configuration, and a
logical-based naming scheme for firewall rules and other policy-based
configuration.  I can't do that with a single name.

I can't please everyone.  I can't keep the kernel's eth* namespace
intact, as it is meaningless and non-deterministic.  I can switch
names to another namespace, at the risk of breaking all the
applications that have bad assumptions.  And I can't have multiple
names for the same device.  But if I have multiple names for the same
device, then I can keep the eth* namespace intact (meaningless as it
is), and provide more meaningful names that work too.

I'm not hung up on the char device.  If I could have multiple names
for the same device, done entirely inside the kernel, I'd go for that
too.  That suggestion has met similar resistance.  Or any other
mechanism, I'm open to also.  But _not_ solving it is no longer an
option for me and my customers.

-- 
Matt Domsch
Technology Strategist, Dell Office of the CTO
linux.dell.com & www.dell.com/linux

^ permalink raw reply

* Re: 2.6.32-rc4: Reported regressions 2.6.30 -> 2.6.31
From: Andrew Patterson @ 2009-10-12 19:58 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Linux Kernel Mailing List, Andrew Morton, Linus Torvalds,
	Natalie Protasevich, Kernel Testers List, Network Development,
	Linux ACPI, Linux PM List, Linux SCSI List, Linux Wireless List,
	DRI
In-Reply-To: <56acieJJ2fF.A.nEB.Hzl0KB@chimera>

On Mon, 2009-10-12 at 00:41 +0200, Rafael J. Wysocki wrote:
> 
> 
> Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=14309
> Subject		: MCA on hp rx8640
> Submitter	: Andrew Patterson <andrew.patterson@hp.com>
> Date		: 2009-09-29 17:20 (13 days old)
> First-Bad-Commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=db8be50c4307dac2b37305fc59c8dc0f978d09ea
> References	: http://www.spinics.net/lists/linux-usb/msg22799.html
> 

Linus fixed this one with d93a8f829fe1d2f3002f2c6ddb553d12db420412.  It
also looks like a duplicate of
http://bugzilla.kernel.org/show_bug.cgi?id=14374

Thanks,

Andrew
-- 
Andrew Patterson
Hewlett-Packard


^ permalink raw reply

* Re: Ping Is Broken
From: Brian Haley @ 2009-10-12 20:36 UTC (permalink / raw)
  To: Rob.Townley; +Cc: netdev, Omaha Linux User Group, CentOS mailing list
In-Reply-To: <7e84ed60910090944q5c66ea0w63ed55a72482bf2f@mail.gmail.com>

Rob Townley wrote:
> ping -I is broken
> 
> The following deals with bug in ping that made it very difficult to set up a
> system with two gateways.
> 
> Demonstration that *ping -I is broken*. When specifying the source
> interface using -I with an *ethX* alias and that interface is not the
> default gateway
> interface, then ping fails. When specifying the interface as an ip address,
> ping works. Search for "Destination Host Unreachable" to find the bug.

I believe ping is working properly here, see below.

> eth*0* = 4.3.2.8 and the default gateway is accessed through a different
> interface eth*1*.
> eth*1* = 192.168.168.155 is used as the device to get to the default
> gateway.
> *FAILS *: ping *-I eth0* 208.67.222.222
> *WORKS*: ping *-I 4.3.2.8* 208.67.222.222
> *WORKS*: ping *-I eth1* 208.67.222.222
> *WORKS*: ping *-I 192.168.168.155* 208.67.222.222
> 
> The following are actual results which can be reproduced from an up-to-date
> Fedora 11 or CentOS 5.3 box. Caused a very very long episode of frustration
> when setting up multi gatewayed systems.
> 
> 
> * ping using eth0 *:
> 
> ping -c 2 -B -I  eth0 208.67.222.222
> PING 208.67.222.222 (208.67.222.222) from 4.3.2.8 eth0: 56(84) bytes of data.
> From 4.3.2.8 icmp_seq=1 Destination Host Unreachable
> From 4.3.2.8 icmp_seq=2 Destination Host Unreachable

In this case ping is doing an SO_BINDTODEVICE to eth0, so the kernel is going
to force the packets out of it, even if it isn't the "correct" interface.  If
you ran tcpdump you'd probably see an ARP resolution failure, or an ICMP from
a gateway.

This confusion could be cleared-up on the man page.  What did you expect to
happen in this case?

> The Following all WORK:
> * ping using 4.3.2.8 *:
> 
> ping -c 2 -B -I  4.3.2.8 208.67.222.222
> PING 208.67.222.222 (208.67.222.222) from 4.3.2.8 : 56(84) bytes of data.
> 64 bytes from 208.67.222.222: icmp_seq=1 ttl=55 time=562 ms
> 64 bytes from 208.67.222.222: icmp_seq=2 ttl=55 time=642 ms

In this case ping is going to bind() the source address to 4.3.2.8, but not
restrict the interface at all.  It works because of the weak end-host model
of Linux where that address can be used on any interface, not just the one
it is configured on.

Your other two examples are similar to this one.

-Brian

^ permalink raw reply

* Re: Ping Is Broken
From: Jarek Poplawski @ 2009-10-12 20:43 UTC (permalink / raw)
  To: Rob Townley
  Cc: CentOS mailing list, public-netdev-u79uwXL29TY76Z2rM5mHXA,
	Omaha Linux User Group
In-Reply-To: <7e84ed60910121214n71413383v3ee703ea6042f355@mail.gmail.com>



On Mon, Oct 12, 2009 at 02:14:13PM -0500, Rob Townley wrote:
> On Mon, Oct 12, 2009 at 4:47 AM, Jarek Poplawski <jarkao2@gmail.com> wrote:
> >
> >
> > On 09-10-2009 18:44, Rob Townley wrote:
> >> ping -I is broken
> >>
> >> The following deals with bug in ping that made it very difficult to set up a
> >> system with two gateways.
> >>
> >> Demonstration that *ping -I is broken*. When specifying the source
> >> interface using -I with an *ethX* alias and that interface is not the
> >> default gateway
> >> interface, then ping fails. When specifying the interface as an ip address,
> >> ping works. Search for "Destination Host Unreachable" to find the bug.
> >>
> >>
> >> eth*0* = 4.3.2.8 and the default gateway is accessed through a different
> >> interface eth*1*.
> >> eth*1* = 192.168.168.155 is used as the device to get to the default
> >> gateway.
> >> *FAILS *: ping *-I eth0* 208.67.222.222
> >> *WORKS*: ping *-I 4.3.2.8* 208.67.222.222
> >> *WORKS*: ping *-I eth1* 208.67.222.222
> >> *WORKS*: ping *-I 192.168.168.155* 208.67.222.222
> > ...
> >> man ping:
> >>    -I interface address
> >>         Set source address to specified interface address.
> >>         Argument may be *numeric IP address or name of device*.
> >>         When  pinging  IPv6  link-local  address  this option is required.
> >
> > It seems this description might be misleading that IP address and name
> > of device are equivalent here, while they are treated a bit different.
> > The device name is additionally used in a sendmsg message, probably to
> > guarantee the device is really used (not its address only), so it
> > looks like intended.
> >
> >> ping -V returns the latest available on CentOS and Fedora and the
> >> maintainers website:
> >> ping utility, iputils-ss020927
> >
> > I guess the patch below could do what you expect in this case, but
> > rather "man" should be fixed...
> 
> Thank you for the patch.  i will test it. i was trying to find the
> problem using gdb and figure out a patch myself.
> 
> ping used to work the way i expected many many years ago on various
> *nix systems.

This patch is rather to show the main difference a device name could
make here. IMHO it should work in your case (I didn't test it), but
as a matter of fact I'm not sure it's the way (route) you expected.

> Besides, traceroute is broken by the same problem except that
> traceroute is much more explicit with a -i and -s parameters.  Who
> knows what else is broken by all the meddling in interface name
> aliases without testing.
> 
> MultiNic / MultiGatewayed machines are hard enough in Linux, lets not
> give users a reason to use BSD or Windows.

Linux routing, especially multipath, might be simply different than
others, but I wouldn't call it broken (except when it's broken ;-).
In this case I don't think it's proven enough: if you change the
default route to eth0's in your example it should show there is some
consistency in it.

It seems "-I eth0" should mean something else than "-I ip_address"
yet (where it can matter), and ping does it. It's only not documented
enough.

Jarek P.


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox