All of lore.kernel.org
 help / color / mirror / Atom feed
* Regression: "rtnetlink: Compute and store minimum ifinfo dump size" breaks glibc's getifaddrs()
@ 2012-01-27 12:36 David Vrabel
  2012-01-27 21:53 ` David Miller
  0 siblings, 1 reply; 6+ messages in thread
From: David Vrabel @ 2012-01-27 12:36 UTC (permalink / raw)
  To: netdev@vger.kernel.org; +Cc: Greg Rose, David S. Miller

Changeset c7ac8679bec9397afe8918f788cbcef88c38da54 (rtnetlink: Compute
and store minimum ifinfo dump size) applied to 3.1 increased the maximum
size of the RTM_GETLINK message response.

glibc's getifaddrs() function uses a page sized (4 KiB) buffer for the
RTM_GETLINK response and returns a failure if the message is truncated.
This buffer is not large enough if there is a network card with many
virtual functions.

What do you recommend to resolve this regression?

David

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Regression: "rtnetlink: Compute and store minimum ifinfo dump size" breaks glibc's getifaddrs()
  2012-01-27 12:36 Regression: "rtnetlink: Compute and store minimum ifinfo dump size" breaks glibc's getifaddrs() David Vrabel
@ 2012-01-27 21:53 ` David Miller
  2012-01-27 22:26   ` Štefan Gula
  2012-01-30 16:50   ` Rose, Gregory V
  0 siblings, 2 replies; 6+ messages in thread
From: David Miller @ 2012-01-27 21:53 UTC (permalink / raw)
  To: david.vrabel; +Cc: netdev, gregory.v.rose, steweg

From: David Vrabel <david.vrabel@citrix.com>
Date: Fri, 27 Jan 2012 12:36:47 +0000

> Changeset c7ac8679bec9397afe8918f788cbcef88c38da54 (rtnetlink: Compute
> and store minimum ifinfo dump size) applied to 3.1 increased the maximum
> size of the RTM_GETLINK message response.
> 
> glibc's getifaddrs() function uses a page sized (4 KiB) buffer for the
> RTM_GETLINK response and returns a failure if the message is truncated.
> This buffer is not large enough if there is a network card with many
> virtual functions.
> 
> What do you recommend to resolve this regression?

Actually, glibc technically uses the CPP define value "PAGE_SIZE" if
available, which is potentially different from the system page size.

Using a statically defined PAGE_SIZE is wrong if the program is
subsequently executed on a system with a different page size.

On sparc, and powerpc I believe, this happens commonly.  A 32-bit
executable will see a PAGE_SIZE value of 4K, but when executed on a
64-bit system the page size is actually 8K or larger.  That's why
__getpagesize() should always be used.

Anyways, if the page sizes are correct we're in a bit of a pickle.

Do you have any idea what the computed value of min_ifinfo_dump_size
is at the time of the failure?

Greg, I think we're kind of screwed.  The defined minimum appropriate
buffer size for a recvmsg() call on a netlink socket is defined as:

	getpagesize() < 8192 ? getpagesize() : 8192

and glibc essentially abides by this by unconditionally using page
size, and therefore if an interface with many virtual interfaces takes
us over this limit, we break basically every properly written piece of
netlink code out there.

iproute2 is funny, it uses a static 16K buffer size for netlink
message reception, which tends to paper over this problem :-/

Stefan, this plays into your work too, we're starting to create
situations which are going to cause very serious problems.  Imagine
an interface with lots of these new macvlan source rules, it would
potentially exceed the limit in the above equation as well.

Thanks.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Regression: "rtnetlink: Compute and store minimum ifinfo dump size" breaks glibc's getifaddrs()
  2012-01-27 21:53 ` David Miller
@ 2012-01-27 22:26   ` Štefan Gula
  2012-01-30 16:50   ` Rose, Gregory V
  1 sibling, 0 replies; 6+ messages in thread
From: Štefan Gula @ 2012-01-27 22:26 UTC (permalink / raw)
  To: David Miller; +Cc: david.vrabel, netdev, gregory.v.rose

2012/1/27 David Miller <davem@davemloft.net>:
> From: David Vrabel <david.vrabel@citrix.com>
> Date: Fri, 27 Jan 2012 12:36:47 +0000
>
>> Changeset c7ac8679bec9397afe8918f788cbcef88c38da54 (rtnetlink: Compute
>> and store minimum ifinfo dump size) applied to 3.1 increased the maximum
>> size of the RTM_GETLINK message response.
>>
>> glibc's getifaddrs() function uses a page sized (4 KiB) buffer for the
>> RTM_GETLINK response and returns a failure if the message is truncated.
>> This buffer is not large enough if there is a network card with many
>> virtual functions.
>>
>> What do you recommend to resolve this regression?
>
> Actually, glibc technically uses the CPP define value "PAGE_SIZE" if
> available, which is potentially different from the system page size.
>
> Using a statically defined PAGE_SIZE is wrong if the program is
> subsequently executed on a system with a different page size.
>
> On sparc, and powerpc I believe, this happens commonly.  A 32-bit
> executable will see a PAGE_SIZE value of 4K, but when executed on a
> 64-bit system the page size is actually 8K or larger.  That's why
> __getpagesize() should always be used.
>
> Anyways, if the page sizes are correct we're in a bit of a pickle.
>
> Do you have any idea what the computed value of min_ifinfo_dump_size
> is at the time of the failure?
>
> Greg, I think we're kind of screwed.  The defined minimum appropriate
> buffer size for a recvmsg() call on a netlink socket is defined as:
>
>        getpagesize() < 8192 ? getpagesize() : 8192
>
> and glibc essentially abides by this by unconditionally using page
> size, and therefore if an interface with many virtual interfaces takes
> us over this limit, we break basically every properly written piece of
> netlink code out there.
>
> iproute2 is funny, it uses a static 16K buffer size for netlink
> message reception, which tends to paper over this problem :-/
>
> Stefan, this plays into your work too, we're starting to create
> situations which are going to cause very serious problems.  Imagine
> an interface with lots of these new macvlan source rules, it would
> potentially exceed the limit in the above equation as well.
>
> Thanks.
I can theoretically avoid this situation by using multiple small
rtnetlink messages instead of one huge. I believe that ipset code is
using that approach - just have to figure out proper changes to my
macvlan code. But this doesn't solves the issue you are mentioning, it
only prohibits to happen for macvlan code as one page size will be
more than enough to allocate buffer....

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: Regression: "rtnetlink: Compute and store minimum ifinfo dump size" breaks glibc's getifaddrs()
  2012-01-27 21:53 ` David Miller
  2012-01-27 22:26   ` Štefan Gula
@ 2012-01-30 16:50   ` Rose, Gregory V
  2012-01-30 16:58     ` David Miller
  1 sibling, 1 reply; 6+ messages in thread
From: Rose, Gregory V @ 2012-01-30 16:50 UTC (permalink / raw)
  To: David Miller, david.vrabel@citrix.com
  Cc: netdev@vger.kernel.org, steweg@gmail.com

> -----Original Message-----
> From: David Miller [mailto:davem@davemloft.net]
> Sent: Friday, January 27, 2012 1:54 PM
> To: david.vrabel@citrix.com
> Cc: netdev@vger.kernel.org; Rose, Gregory V; steweg@gmail.com
> Subject: Re: Regression: "rtnetlink: Compute and store minimum ifinfo dump
> size" breaks glibc's getifaddrs()
> 
> From: David Vrabel <david.vrabel@citrix.com>
> Date: Fri, 27 Jan 2012 12:36:47 +0000
> 
> > Changeset c7ac8679bec9397afe8918f788cbcef88c38da54 (rtnetlink: Compute
> > and store minimum ifinfo dump size) applied to 3.1 increased the maximum
> > size of the RTM_GETLINK message response.
> >
> > glibc's getifaddrs() function uses a page sized (4 KiB) buffer for the
> > RTM_GETLINK response and returns a failure if the message is truncated.
> > This buffer is not large enough if there is a network card with many
> > virtual functions.
> >
> > What do you recommend to resolve this regression?
> 
> Actually, glibc technically uses the CPP define value "PAGE_SIZE" if
> available, which is potentially different from the system page size.
> 
> Using a statically defined PAGE_SIZE is wrong if the program is
> subsequently executed on a system with a different page size.
> 
> On sparc, and powerpc I believe, this happens commonly.  A 32-bit
> executable will see a PAGE_SIZE value of 4K, but when executed on a
> 64-bit system the page size is actually 8K or larger.  That's why
> __getpagesize() should always be used.
> 
> Anyways, if the page sizes are correct we're in a bit of a pickle.
> 
> Do you have any idea what the computed value of min_ifinfo_dump_size
> is at the time of the failure?
> 
> Greg, I think we're kind of screwed.  The defined minimum appropriate
> buffer size for a recvmsg() call on a netlink socket is defined as:
> 
> 	getpagesize() < 8192 ? getpagesize() : 8192
> 
> and glibc essentially abides by this by unconditionally using page
> size, and therefore if an interface with many virtual interfaces takes
> us over this limit, we break basically every properly written piece of
> netlink code out there.

Yep, we're screwed.  And as more features and capabilities get added to virtual functions that we want to control through netlink it gets even worse.

Maybe we should have 'ip link show' just display the number of VFs and then have a new 'ip' tool syntax along the lines of 'ip link show eth(x) vf (n)' where eth(x) is the PF and n is the number of the VF.  Then it could show all relevant information for just that VF.  Scripts could parse the number of VFs from the first call to 'ip link show' and then loop to show the details of each VF.

Just an idea... maybe there are other ones out there but it's just getting ridiculous how much data has to be transferred back and forth during the basic 'ip link show' command when the interface as subordinate VFs now that we're getting to devices with up to 256 VFs.

- Greg

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Regression: "rtnetlink: Compute and store minimum ifinfo dump size" breaks glibc's getifaddrs()
  2012-01-30 16:50   ` Rose, Gregory V
@ 2012-01-30 16:58     ` David Miller
  2012-01-30 17:02       ` Rose, Gregory V
  0 siblings, 1 reply; 6+ messages in thread
From: David Miller @ 2012-01-30 16:58 UTC (permalink / raw)
  To: gregory.v.rose; +Cc: david.vrabel, netdev, steweg

From: "Rose, Gregory V" <gregory.v.rose@intel.com>
Date: Mon, 30 Jan 2012 16:50:58 +0000

> Maybe we should have 'ip link show' just display the number of VFs
> and then have a new 'ip' tool syntax along the lines of 'ip link
> show eth(x) vf (n)' where eth(x) is the PF and n is the number of
> the VF.  Then it could show all relevant information for just that
> VF.  Scripts could parse the number of VFs from the first call to
> 'ip link show' and then loop to show the details of each VF.
> 
> Just an idea... maybe there are other ones out there but it's just
> getting ridiculous how much data has to be transferred back and
> forth during the basic 'ip link show' command when the interface as
> subordinate VFs now that we're getting to devices with up to 256
> VFs.

The fix is almost certainly that we need to require that the application
turn on the publishing of features it is interested in.

The inet_diag sockets do something like this, wherein you set bits in
some flags of the request saying what attributes you want to see.

It seems clear that we must, at this point, turn off VF attributes by
default.

Could you do some work on this?  If you don't have the time I'll look
into it, because we'll need to backport this to -stable too.

Thanks.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: Regression: "rtnetlink: Compute and store minimum ifinfo dump size" breaks glibc's getifaddrs()
  2012-01-30 16:58     ` David Miller
@ 2012-01-30 17:02       ` Rose, Gregory V
  0 siblings, 0 replies; 6+ messages in thread
From: Rose, Gregory V @ 2012-01-30 17:02 UTC (permalink / raw)
  To: David Miller
  Cc: david.vrabel@citrix.com, netdev@vger.kernel.org, steweg@gmail.com

> -----Original Message-----
> From: David Miller [mailto:davem@davemloft.net]
> Sent: Monday, January 30, 2012 8:58 AM
> To: Rose, Gregory V
> Cc: david.vrabel@citrix.com; netdev@vger.kernel.org; steweg@gmail.com
> Subject: Re: Regression: "rtnetlink: Compute and store minimum ifinfo dump
> size" breaks glibc's getifaddrs()
> 
> From: "Rose, Gregory V" <gregory.v.rose@intel.com>
> Date: Mon, 30 Jan 2012 16:50:58 +0000
> 
> > Maybe we should have 'ip link show' just display the number of VFs
> > and then have a new 'ip' tool syntax along the lines of 'ip link
> > show eth(x) vf (n)' where eth(x) is the PF and n is the number of
> > the VF.  Then it could show all relevant information for just that
> > VF.  Scripts could parse the number of VFs from the first call to
> > 'ip link show' and then loop to show the details of each VF.
> >
> > Just an idea... maybe there are other ones out there but it's just
> > getting ridiculous how much data has to be transferred back and
> > forth during the basic 'ip link show' command when the interface as
> > subordinate VFs now that we're getting to devices with up to 256
> > VFs.
> 
> The fix is almost certainly that we need to require that the application
> turn on the publishing of features it is interested in.
> 
> The inet_diag sockets do something like this, wherein you set bits in
> some flags of the request saying what attributes you want to see.
> 
> It seems clear that we must, at this point, turn off VF attributes by
> default.
> 
> Could you do some work on this?  If you don't have the time I'll look
> into it, because we'll need to backport this to -stable too.

Yeah, I can push out some other things I'm working on to get on this.

- Greg

> 
> Thanks.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2012-01-30 17:03 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-01-27 12:36 Regression: "rtnetlink: Compute and store minimum ifinfo dump size" breaks glibc's getifaddrs() David Vrabel
2012-01-27 21:53 ` David Miller
2012-01-27 22:26   ` Štefan Gula
2012-01-30 16:50   ` Rose, Gregory V
2012-01-30 16:58     ` David Miller
2012-01-30 17:02       ` Rose, Gregory V

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.