Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [Bugme-new] [Bug 33502] New: Caught 64-bit read from uninitialized memory in __alloc_skb
From: Christoph Lameter @ 2011-05-10 19:38 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Vegard Nossum, Pekka Enberg, casteyde.christian, Andrew Morton,
	netdev, bugzilla-daemon, bugme-daemon
In-Reply-To: <1305055948.2437.13.camel@edumazet-laptop>

On Tue, 10 May 2011, Eric Dumazet wrote:

> > > You have to disable IRQ _before_ even fetching 'object'
> >
> > The object pointer is being obtained from a per cpu structure and
> > not from the page. What is the problem with fetching the object pointer?
> >
> > > Or else, you can have an IRQ, allocate this object, pass to another cpu.
> >
> > If that occurs then TID is being incremented and we will restart the loop
> > getting the new object pointer from the per cpu structure. The object
> > pointer that we were considering is irrelevant.
> >
>
> Problem is not restart the loop, but avoiding accessing a non valid
> memory area.

Yes and could you please explain clearly what the problem is?

>
> > > This other cpu can free the object and unmap page right after you did
> > > the probe_kernel_address(object) (successfully), and before your cpu :
> > >
> > > p = get_freepointer(s, object); << BUG >>
> >
> > If the other cpu frees the object and unmaps the page then
> > get_freepointer_safe() can obtain an arbitrary value since the TID was
> > incremented. We will restart the loop and discard the value retrieved.
> >
>
>
>
> In current code I see :
>
> tid = c->tid;
> barrier();
> object = c->freelist;
>
> There is no guarantee c->tid is fetched before c->freelist by cpu.
>
> You need rmb() here.

Nope. This is not processor to processor concurrency. this_cpu operations
only deal with concurrency issues on the same processor. I.e. interrupts
and preemption.

> I claim all this would be far more simple disabling IRQ before fetching
> c->tid and c->freelist, in DEBUG_PAGE_ALLOC case.
>
> You would not even need to use magic probe_kernel_read()
>
>
> Why do you try so _hard_ trying to optimize this, I really wonder.
> Nobody is able to read this code anymore and prove its correct.

Optimizing? You think about this as concurrency issue between multiple
cpus. That is fundamentally wrong. This is dealing with access to per cpu
data and the concurrency issues are only with code running on the *same*
cpu.


^ permalink raw reply

* Re: [PATCH 3/3] RFC gianfar: add rx_ntuple feature
From: Ben Hutchings @ 2011-05-10 19:38 UTC (permalink / raw)
  To: Sebastian.Poehn; +Cc: netdev
In-Reply-To: <OF56049997.B9542799-ON8525788C.00470640-8525788C.00470642@BeldenCDT.com>

As a general warning, you may find that the RX NFC interface makes more
sense.  So far only ixgbe and sfc implement the RX n-tuple interface and
ixgbe will be moving to RX NFC.

I don't know quite what the capabilities of this hardware are, so it may
be that RX NFC doesn't make much sense.

On Tue, 2011-05-10 at 08:55 -0400, Sebastian.Poehn@Belden.com wrote:
> This  is the main part. Functionality to add and remove ntuples,
> conversion  from ntuple to hardware binary rx filer format,
> optimization of hardware  filer table entries and extended hardware
> capability check.
> 
> --- gianfar_ethtool.c.orig	2011-05-10 11:45:33.301745000 +0200
> +++ gianfar_ethtool.c	2011-05-10 13:27:23.041744819 +0200

Diffs should be made from above the linux-2.6 directory (or using 'git
diff' or similar).

> @@ -42,6 +42,8 @@
>  
>  extern void gfar_start(struct net_device *dev);
>  extern int gfar_clean_rx_ring(struct gfar_priv_rx_q *rx_queue, int rx_work_limit);
> +extern void sort(void *, size_t, size_t, int(*cmp_func)(const void *,
> +		const void *), void(*swap_func)(void *, void *, int size));

Why are you declaring this here rather than including <linux/sort.h>?

>  #define GFAR_MAX_COAL_USECS 0xffff
>  #define GFAR_MAX_COAL_FRAMES 0xff
> @@ -787,6 +789,1011 @@ static int gfar_set_nfc(struct net_devic
>  	return ret;
>  }
>  
> +/*Global pointer on table*/
> +struct filer_table *ref;
> +u32 filer_index;
> +struct interf *queue;
> +
> +enum nop {
> +	ASC = 0, DESC = 1
> +} row;

Is this global state really necessary?  I think not.

> +static inline void toggle_order(void)
> +{
> +	row ^= 1;
> +}
> +
> +static int my_comp(const void *a, const void *b)
> +{
> +
> +	signed int temp;
> +	if (*(u32 *) a > *(u32 *) b)
> +		temp = -1;
> +	else if (*(u32 *) a == *(u32 *) b)
> +		temp = 0;
> +	else
> +		temp = 1;
> +
> +	if (row == DESC)
> +		return temp;
> +	else
> +		return -temp;
> +}

Use a second comparison function to reverse the order.

> +static void my_swap(void *a, void *b, int size)
> +{
> +	u32 t1 = *(u32 *) a;
> +	u32 t2 = *(u32 *) (a + 4);
> +	u32 t3 = *(u32 *) (a + 8);
> +	u32 t4 = *(u32 *) (a + 12);
> +	*(u32 *) a = *(u32 *) b;
> +	*(u32 *) (a + 4) = *(u32 *) (b + 4);
> +	*(u32 *) (a + 8) = *(u32 *) (b + 8);
> +	*(u32 *) (a + 12) = *(u32 *) (b + 12);
> +	*(u32 *) b = t1;
> +	*(u32 *) (b + 4) = t2;
> +	*(u32 *) (b + 8) = t3;
> +	*(u32 *) (b + 12) = t4;
> +}
> +
> +/*Write a mask to hardware*/
> +static inline void set_mask(u32 mask)
> +{
> +	ref->fe[filer_index].ctrl = RQFCR_AND | RQFCR_PID_MASK
> +			| RQFCR_CMP_EXACT;
> +	ref->fe[filer_index].prop = mask;
> +	filer_index++;
> +}
> +
> +/*Sets parse bits (e.g. IP or TCP)*/
> +static void set_parse_bits(u32 host, u32 mask)
> +{
> +	set_mask(mask);
> +	ref->fe[filer_index].ctrl = RQFCR_CMP_EXACT | RQFCR_PID_PARSE
> +			| RQFCR_AND;
> +	ref->fe[filer_index].prop = host;
> +	filer_index++;
> +}
> +
> +/*For setting a tuple of host,mask of type flag
> + *Example:
> + *IP-Src = 10.0.0.0/255.0.0.0
> + *host: 0x0A000000 mask: FF000000 flag: RQFPR_IPV4
> + *Note:
> + *For better usage of hardware 16 and 8 bit masks should be filled up
> + *with ones*/
> +static void set_attribute(unsigned int host, unsigned int mask,
> +		unsigned int flag)

It would be clearer to rename 'host' as 'value'.

> +{
> +	if (host || ~mask) {

If all bits are masked then the 'host' value must be ignored.  So just
check ~mask.

> +		/*This is to deal with masks smaller than 32bit
> +		 * and for special processing of MAC-filtering and
> +		 * VLAN-filtering*/
> +		switch (flag) {
> +		/*3bit*/
> +		case RQFCR_PID_PRI:
> +			if (((host & 0x7) == 0) && ((mask & 0x7) == 0))
> +				return;

Doesn't this mean that an n-tuple filter that should match priority 0
will actually match all priority values?

> +			host &= 0x7;
> +			break;
> +			/*8bit*/
> +		case RQFCR_PID_L4P:
> +		case RQFCR_PID_TOS:
> +			if (!(mask & 0xFF))
> +				mask = 0xFFFFFFFF;

I don't understand this special case.  Are you sure you shouldn't be
using something like:

			mask ^= 0xff;

> +			break;
> +			/*12bit*/
> +		case RQFCR_PID_VID:
> +			if (((host & 0xFFF) == 0) && ((mask & 0xFFF) == 0))
> +				return;

Again, this seems to mean that a filter that should match VID 0 (i.e.
untagged) will match both tagged and untagged frames.

> +			host &= 0xFFF;
> +			break;
> +			/*16bit*/
> +		case RQFCR_PID_DPT:
> +		case RQFCR_PID_SPT:
> +		case RQFCR_PID_ETY:
> +			if (!(mask & 0xFFFF))
> +				mask = 0xFFFFFFFF;

Again, I don't understand this special case.

> +			break;
> +			/*24bit*/
> +		case RQFCR_PID_DAH:
> +		case RQFCR_PID_DAL:
> +		case RQFCR_PID_SAH:
> +		case RQFCR_PID_SAL:
> +			host &= 0x00FFFFFF;
> +			break;
> +			/*for all real 32bit masks*/
> +		default:
> +			if (!mask)
> +				mask = 0xFFFFFFFF;
> +			break;
> +		}
> +
> +		set_mask(mask);
> +		ref->fe[filer_index].ctrl = RQFCR_CMP_EXACT | RQFCR_AND | flag;
> +		ref->fe[filer_index].prop = host;
> +		filer_index++;
> +	}
> +}
> +
> +/*Translates host and mask for UDP,TCP or SCTP*/
> +static void set_basic_ip(struct ethtool_tcpip4_spec *host,
> +		struct ethtool_tcpip4_spec *mask)
> +{
> +	set_attribute(host->ip4src, mask->ip4src, RQFCR_PID_SIA);
> +	set_attribute(host->ip4dst, mask->ip4dst, RQFCR_PID_DIA);
> +	set_attribute(host->pdst, mask->pdst | 0xFFFF0000, RQFCR_PID_DPT);
> +	set_attribute(host->psrc, mask->psrc | 0xFFFF0000, RQFCR_PID_SPT);
> +	set_attribute(host->tos, mask->tos | 0xFFFFFF00, RQFCR_PID_TOS);
> +}
> +
> +/*Translates host and mask for USER-IP4*/
> +static inline void set_user_ip(struct ethtool_usrip4_spec *host,
> +		struct ethtool_usrip4_spec *mask)
> +{
> +
> +	set_attribute(host->ip4src, mask->ip4src, RQFCR_PID_SIA);
> +	set_attribute(host->ip4dst, mask->ip4dst, RQFCR_PID_DIA);
> +	set_attribute(host->tos, mask->tos | 0xFFFFFF00, RQFCR_PID_TOS);
> +	set_attribute(host->proto, mask->proto | 0xFFFFFF00, RQFCR_PID_L4P);
> +	set_attribute(host->l4_4_bytes, mask->l4_4_bytes, RQFCR_PID_ARB);
> +
> +}
> +
> +/*Translates host and mask for ETHER spec*/
> +static inline void set_ether(struct ethhdr *host, struct ethhdr *mask)
> +{
> +	u32 upper_temp_mask = 0;
> +	u32 lower_temp_mask = 0;
> +	/*Source address*/
> +	if (!(is_zero_ether_addr(host->h_source) && is_broadcast_ether_addr(
> +			mask->h_source))) {

Just check !is_broadcast_ether_addr(mask->h_source).

> +		if (is_zero_ether_addr(mask->h_source)) {
> +			upper_temp_mask = 0xFFFFFFFF;
> +			lower_temp_mask = 0xFFFFFFFF;
> +		} else {
> +			upper_temp_mask = mask->h_source[0] << 16
> +					| mask->h_source[1] << 8
> +					| mask->h_source[2] | 0xFF000000;
> +			lower_temp_mask = mask->h_source[3] << 16
> +					| mask->h_source[4] << 8
> +					| mask->h_source[5] | 0xFF000000;
> +		}
> +		/*Upper 24bit*/
> +		set_attribute(0x80000000 | host->h_source[0] << 16
> +				| host->h_source[1] << 8 | host->h_source[2],
> +				upper_temp_mask, RQFCR_PID_SAH);
> +		/*And the same for the lower part*/
> +		set_attribute(0x80000000 | host->h_source[3] << 16
> +				| host->h_source[4] << 8 | host->h_source[5],
> +				lower_temp_mask, RQFCR_PID_SAL);
> +	}
> +	/*Destination address*/
> +	if (!(is_zero_ether_addr(host->h_dest) && is_broadcast_ether_addr(
> +			mask->h_dest))) {

Similarly here, just test the mask.

> +		/*Special for destination is limited broadcast*/
> +		if ((is_broadcast_ether_addr(host->h_dest)
> +				&& is_zero_ether_addr(mask->h_dest))) {
> +			set_parse_bits(RQFPR_EBC, RQFPR_EBC);
> +		} else {
> +
> +			if (is_zero_ether_addr(mask->h_dest)) {
> +				upper_temp_mask = 0xFFFFFFFF;
> +				lower_temp_mask = 0xFFFFFFFF;
> +			} else {
> +				upper_temp_mask = mask->h_dest[0] << 16
> +						| mask->h_dest[1] << 8
> +						| mask->h_dest[2] | 0xFF000000;
> +				lower_temp_mask = mask->h_dest[3] << 16
> +						| mask->h_dest[4] << 8
> +						| mask->h_dest[5] | 0xFF000000;
> +			}
> +
> +			/*Upper 24bit*/
> +			set_attribute(0x80000000 | host->h_dest[0] << 16
> +					| host->h_dest[1] << 8
> +					| host->h_dest[2], upper_temp_mask,
> +					RQFCR_PID_DAH);
> +			/*And the same for the lower part*/
> +			set_attribute(0x80000000 | host->h_dest[3] << 16
> +					| host->h_dest[4] << 8
> +					| host->h_dest[5], lower_temp_mask,
> +					RQFCR_PID_DAL);
> +		}
> +	}
> +
> +	/*Set Ethertype*/
> +	if ((host->h_proto || ~(mask->h_proto | 0xFFFF0000))) {

Similarly here, just test the mask.

> +		set_attribute(host->h_proto, mask->h_proto | 0xFFFF0000,
> +				RQFCR_PID_ETY);
> +	}
> +
> +	/*
> +	 * Question: What the hell does the 0x80000000 do?
> +	 * Answer: It is just a dirty hack to prevent the setAtribute()
> +	 * to ignore a half MAC address which is like 0x000000/0xFFFFFF
> +	 */

Why would it do that?

Is a filter that matches only upper or only lower 24 bits of a MAC
address invalid?

[Skipped more stuff; I haven't got time to review all of this.]

[...]
> +static int gfar_set_rx_ntuple(struct net_device *dev,
> +		struct ethtool_rx_ntuple *cmd)
> +{	struct gfar __iomem *regs = NULL;
> +	struct gfar_private *priv = netdev_priv(dev);
> +	int i = 0;
> +	static struct interf *store[10];
> +
> +	regs = priv->gfargrp[0].regs;
> +
> +	/*Only values between -2 and num_rx_queues -1 allowed*/
> +	if ((cmd->fs.action >= (signed int)priv->num_rx_queues) ||
> +	(cmd->fs.action < ETHTOOL_RXNTUPLE_ACTION_CLEAR))
> +		return -EINVAL;
> +
> +	for (i = 0; i < 10; i++) {
> +		if (store[i] == 0) {
> +			store[i] = init_table(priv);
> +			if (store[i] == (struct interf *)-1) {
> +				store[i] = 0;
> +				return -1;
> +			}
> +			strcpy(store[i]->name, dev->name);
> +			break;
> +		} else if (!strcmp(store[i]->name, dev->name)) {
> +			queue = store[i];
> +			break;
> +		}
> +
> +	}

Why aren't you putting this state in struct gfar_private?

You can't use name as a key anyway; interfaces can be renamed.

> +	do_action(&cmd->fs, priv);
> +
> +	return 0;
> +}
> +
> +
>  const struct ethtool_ops gfar_ethtool_ops = {
>  	.get_settings = gfar_gsettings,
>  	.set_settings = gfar_ssettings,
> @@ -808,4 +1815,6 @@ const struct ethtool_ops gfar_ethtool_op
>  	.set_wol = gfar_set_wol,
>  #endif
>  	.set_rxnfc = gfar_set_nfc,
> +	/*function for accessing rx queue filer*/
> +	.set_rx_ntuple = gfar_set_rx_ntuple
>  };
>  
> Signed-off-by: Sebastian Poehn <sebastian.poehn@belden.com>

This belongs at the top, but is not important for an RFC anyway.

> DISCLAIMER:
> 
> Privileged and/or Confidential information may be contained in this
> message. If you are not the addressee of this message, you may not
> copy, use or deliver this message to anyone. In such event, you
> should destroy the message and kindly notify the sender by reply
> e-mail. It is understood that opinions or conclusions that do not
> relate to the official business of the company are neither given
> nor endorsed by the company.

Well this wasn't sent specifically to me, so am I in trouble now?
Please get rid of this nonsense.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: [Bugme-new] [Bug 33502] New: Caught 64-bit read from uninitialized memory in __alloc_skb
From: Eric Dumazet @ 2011-05-10 19:32 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Vegard Nossum, Pekka Enberg, casteyde.christian, Andrew Morton,
	netdev, bugzilla-daemon, bugme-daemon
In-Reply-To: <alpine.DEB.2.00.1105101323290.4023@router.home>

Le mardi 10 mai 2011 à 13:28 -0500, Christoph Lameter a écrit :
> On Tue, 10 May 2011, Eric Dumazet wrote:
> 
> > > +	else
> > > +		p = get_freepointer(s, object);
> > > +
> > > +	local_irq_restore(flags);
> > > +#else
> > > +	p = get_freepointer(s, object);
> > > +#endif
> > > +	return p;
> > > +}
> > > +
> > >  static inline void set_freepointer(struct kmem_cache *s, void *object, void *fp)
> > >  {
> > >  	*(void **)(object + s->offset) = fp;
> > > @@ -1933,7 +1954,7 @@ redo:
> > >  		if (unlikely(!irqsafe_cpu_cmpxchg_double(
> > >  				s->cpu_slab->freelist, s->cpu_slab->tid,
> > >  				object, tid,
> > > -				get_freepointer(s, object), next_tid(tid)))) {
> > > +				get_freepointer_safe(s, object), next_tid(tid)))) {
> > >
> > >  			note_cmpxchg_failure("slab_alloc", s, tid);
> > >  			goto redo;
> >
> >
> > Really this wont work Stephen
> 
> I am not Stephen.
> 

Yes, sorry Christoph.

> > You have to disable IRQ _before_ even fetching 'object'
> 
> The object pointer is being obtained from a per cpu structure and
> not from the page. What is the problem with fetching the object pointer?
> 
> > Or else, you can have an IRQ, allocate this object, pass to another cpu.
> 
> If that occurs then TID is being incremented and we will restart the loop
> getting the new object pointer from the per cpu structure. The object
> pointer that we were considering is irrelevant.
> 

Problem is not restart the loop, but avoiding accessing a non valid
memory area.

> > This other cpu can free the object and unmap page right after you did
> > the probe_kernel_address(object) (successfully), and before your cpu :
> >
> > p = get_freepointer(s, object); << BUG >>
> 
> If the other cpu frees the object and unmaps the page then
> get_freepointer_safe() can obtain an arbitrary value since the TID was
> incremented. We will restart the loop and discard the value retrieved.
> 



In current code I see :

tid = c->tid;
barrier();
object = c->freelist;

There is no guarantee c->tid is fetched before c->freelist by cpu.

You need rmb() here.


I claim all this would be far more simple disabling IRQ before fetching
c->tid and c->freelist, in DEBUG_PAGE_ALLOC case.

You would not even need to use magic probe_kernel_read()


Why do you try so _hard_ trying to optimize this, I really wonder.
Nobody is able to read this code anymore and prove its correct.




^ permalink raw reply

* Re: [PATCH] xfrm: Don't allow esn with disabled anti replay detection
From: David Miller @ 2011-05-10 19:28 UTC (permalink / raw)
  To: steffen.klassert; +Cc: herbert, netdev
In-Reply-To: <20110510054305.GC8013@secunet.com>

From: Steffen Klassert <steffen.klassert@secunet.com>
Date: Tue, 10 May 2011 07:43:05 +0200

> Unlike the standard case, disabled anti replay detection needs some
> nontrivial extra treatment on ESN. RFC 4303 states:
> 
> Note: If a receiver chooses to not enable anti-replay for an SA, then
> the receiver SHOULD NOT negotiate ESN in an SA management protocol.
> Use of ESN creates a need for the receiver to manage the anti-replay
> window (in order to determine the correct value for the high-order
> bits of the ESN, which are employed in the ICV computation), which is
> generally contrary to the notion of disabling anti-replay for an SA.
> 
> So return an error if an ESN state with disabled anti replay detection
> is inserted for now and add the extra treatment later if we need it.
> 
> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>

Also applied, thanks for fixing these bugs!

^ permalink raw reply

* Re: [PATCH] xfrm: Assign the inner mode output function to the dst entry
From: David Miller @ 2011-05-10 19:28 UTC (permalink / raw)
  To: steffen.klassert; +Cc: herbert, netdev
In-Reply-To: <20110510053638.GB8013@secunet.com>

From: Steffen Klassert <steffen.klassert@secunet.com>
Date: Tue, 10 May 2011 07:36:38 +0200

> As it is, we assign the outer modes output function to the dst entry
> when we create the xfrm bundle. This leads to two problems on interfamily
> scenarios. We might insert ipv4 packets into ip6_fragment when called
> from xfrm6_output. The system crashes if we try to fragment an ipv4
> packet with ip6_fragment. This issue was introduced with git commit
> ad0081e4 (ipv6: Fragment locally generated tunnel-mode IPSec6 packets
> as needed). The second issue is, that we might insert ipv4 packets in
> netfilter6 and vice versa on interfamily scenarios.
> 
> With this patch we assign the inner mode output function to the dst entry
> when we create the xfrm bundle. So xfrm4_output/xfrm6_output from the inner
> mode is used and the right fragmentation and netfilter functions are called.
> We switch then to outer mode with the output_finish functions.
> 
> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>

Applied.

^ permalink raw reply

* Re: oops during unregister_netdevice interface enslaved to bond - regression
From: David Miller @ 2011-05-10 19:26 UTC (permalink / raw)
  To: eric.dumazet; +Cc: blaschka, netdev, ELELUECK, opurdila
In-Reply-To: <1305034619.2614.37.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 10 May 2011 15:36:59 +0200

> [PATCH net-2.6] net: dev_close() should check IFF_UP

Applied, thanks Eric.

^ permalink raw reply

* Re: oops during unregister_netdevice interface enslaved to bond - regression
From: David Miller @ 2011-05-10 19:25 UTC (permalink / raw)
  To: ELELUECK; +Cc: netdev, Frank.Blaschka
In-Reply-To: <OF0F4919C3.B9CFCAE8-ONC125788C.002D5EB1-C125788C.002D83B7@de.ibm.com>

From: Einar EL Lueck <ELELUECK@de.ibm.com>
Date: Tue, 10 May 2011 10:17:09 +0200

> Calls to the *_many functions introduced by Octavian may never interleave
> because
> the traversed lists modify each other. This was the root cause for the
> symptom that Frank discovered. Octavian is not a valid mail recipient
> anymore and did not react from any new mail address. I suggest to revert
> the commit.

I don't think a pure-revert is appropriate in this case, the regression
that will introduce is almost as serious as the OOPS here.

Someone just needs to work on a fix.

^ permalink raw reply

* Re: [PATCH net-2.6] vlan: fix GVRP at dismantle time
From: David Miller @ 2011-05-10 19:23 UTC (permalink / raw)
  To: eric.dumazet; +Cc: mirqus, alex, netdev, jesse, greearb, kaber
In-Reply-To: <1305009636.3050.60.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 10 May 2011 08:40:36 +0200

> [PATCH net-2.6] vlan: fix GVRP at dismantle time

Applied and queued up for -stable, thanks!

^ permalink raw reply

* Re: [PATCH v2] net: ipv4: add IPPROTO_ICMP socket kind
From: David Miller @ 2011-05-10 19:15 UTC (permalink / raw)
  To: segoon
  Cc: solar, linux-kernel, netdev, peak, kees.cook, dan.j.rosenberg,
	eugene, nelhage, kuznet, pekkas, jmorris, yoshfuji, kaber
In-Reply-To: <20110510180957.GA3262@albatros>

From: Vasiliy Kulikov <segoon@openwall.com>
Date: Tue, 10 May 2011 22:09:59 +0400

In net-next-2.6 we're trying to get rid of uses of route identity
information, and also the types used for flow lookup keys are
completely different.  This code won't compile as-is.

> +	{
> +		struct flowi fl = { .oif = ipc.oif,

This should be "struct flowi4 fl4", declare it at the top
level of the function so you can get at the fully resolved
key values later in this function.

Then use "flowi4_init_output(...) to initialize the flow instead of
this explicit assignment.

> +	if (!ipc.addr)
> +		ipc.addr = rt->rt_dst;

Replase rt->rt_dst with fl4.daddr

> +	err = ip_append_data(sk, ping_getfrag, &pfh, len,
> +			0, &ipc, &rt,
> +			msg->msg_flags);

ip_append_data() now takes a flowi4 key pointer as an argument, so
you'll need to pass "&fl4" in.

A lot has changed in this area, your code won't even compile, so
please adjust your patch to fit net-next-2.6 as needed, perhaps
using net/ipv4/raw.c and net/ipv4/udp.c as a guide.

Thanks.

^ permalink raw reply

* RE: Kernel 2.6.38.6 page allocation failure (ixgbe)
From: Brandeburg, Jesse @ 2011-05-10 19:06 UTC (permalink / raw)
  To: Stefan Majer; +Cc: e1000-devel@lists.sourceforge.net, netdev@vger.kernel.org
In-Reply-To: <BANLkTik=FM5LJs8JUKHR2S+r41vi94Z7pw@mail.gmail.com>

Adding e1000-devel, our list for the out-of-tree ixgbe driver (the issue is reported below to be in both upstream and out-of-tree)

do you have jumbo frames enabled?

-----Original Message-----
From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org] On Behalf Of Stefan Majer
Sent: Tuesday, May 10, 2011 9:03 AM
To: netdev@vger.kernel.org
Subject: Kernel 2.6.38.6 page allocation failure (ixgbe)

Hi,

im running 4 nodes with ceph on top of btrfs with a dualport Intel
X520 10Gb Ethernet Card with the latest 3.3.9 ixgbe driver.
during benchmarks i get the following stack.
I can easily reproduce this by simply running rados bench from a fast
machine using this 4 nodes as ceph cluster.
We saw this with stock ixgbe driver from 2.6.38.6 and with the latest
3.3.9 ixgbe.
This kernel is tainted because we use fusion-io iodrives as journal
devices for btrfs.

Any hints to nail this down are welcome.

Greetings Stefan Majer

May 10 15:26:40 os02 kernel: [ 3652.485219] cosd: page allocation
failure. order:2, mode:0x4020
May 10 15:26:40 os02 kernel: [ 3652.485223] kswapd0: page allocation
failure. order:2, mode:0x4020
May 10 15:26:40 os02 kernel: [ 3652.485228] Pid: 57, comm: kswapd0
Tainted: P        W   2.6.38.6-1.fits.1.el6.x86_64 #1
May 10 15:26:40 os02 kernel: [ 3652.485230] Call Trace:
May 10 15:26:40 os02 kernel: [ 3652.485232]  <IRQ>
[<ffffffff81108ce7>] ? __alloc_pages_nodemask+0x6f7/0x8a0
May 10 15:26:40 os02 kernel: [ 3652.485247]  [<ffffffff814b0ad0>] ?
ip_local_deliver+0x80/0x90
May 10 15:26:40 os02 kernel: [ 3652.485250] cosd: page allocation
failure. order:2, mode:0x4020
May 10 15:26:40 os02 kernel: [ 3652.485256]  [<ffffffff81146cd2>] ?
kmalloc_large_node+0x62/0xb0
May 10 15:26:40 os02 kernel: [ 3652.485259] Pid: 1849, comm: cosd
Tainted: P        W   2.6.38.6-1.fits.1.el6.x86_64 #1
May 10 15:26:40 os02 kernel: [ 3652.485261] Call Trace:
May 10 15:26:40 os02 kernel: [ 3652.485264]  [<ffffffff8114becb>] ?
__kmalloc_node_track_caller+0x15b/0x1d0
May 10 15:26:40 os02 kernel: [ 3652.485266]  <IRQ>
[<ffffffff81466f74>] ? __netdev_alloc_skb+0x24/0x50
May 10 15:26:40 os02 kernel: [ 3652.485274]  [<ffffffff81108ce7>] ?
__alloc_pages_nodemask+0x6f7/0x8a0
May 10 15:26:40 os02 kernel: [ 3652.485277]  [<ffffffff81466713>] ?
__alloc_skb+0x83/0x170
May 10 15:26:40 os02 kernel: [ 3652.485281]  [<ffffffff814b0ad0>] ?
ip_local_deliver+0x80/0x90
May 10 15:26:40 os02 kernel: [ 3652.485283]  [<ffffffff81466f74>] ?
__netdev_alloc_skb+0x24/0x50
May 10 15:26:40 os02 kernel: [ 3652.485287]  [<ffffffff81146cd2>] ?
kmalloc_large_node+0x62/0xb0
May 10 15:26:40 os02 kernel: [ 3652.485297]  [<ffffffffa005d9aa>] ?
ixgbe_alloc_rx_buffers+0x9a/0x450 [ixgbe]
May 10 15:26:40 os02 kernel: [ 3652.485300]  [<ffffffff8114becb>] ?
__kmalloc_node_track_caller+0x15b/0x1d0
May 10 15:26:40 os02 kernel: [ 3652.485305]  [<ffffffff812b79e0>] ?
swiotlb_map_page+0x0/0x110
May 10 15:26:40 os02 kernel: [ 3652.485308]  [<ffffffff81466f74>] ?
__netdev_alloc_skb+0x24/0x50
May 10 15:26:40 os02 kernel: [ 3652.485315]  [<ffffffffa0060930>] ?
ixgbe_poll+0x1140/0x1670 [ixgbe]
May 10 15:26:40 os02 kernel: [ 3652.485318]  [<ffffffff81466713>] ?
__alloc_skb+0x83/0x170
May 10 15:26:40 os02 kernel: [ 3652.485323]  [<ffffffff810f33eb>] ?
perf_pmu_enable+0x2b/0x40
May 10 15:26:40 os02 kernel: [ 3652.485326]  [<ffffffff81466f74>] ?
__netdev_alloc_skb+0x24/0x50
May 10 15:26:40 os02 kernel: [ 3652.485330]  [<ffffffff81474eb2>] ?
net_rx_action+0x102/0x2a0
May 10 15:26:40 os02 kernel: [ 3652.485336]  [<ffffffffa005d9aa>] ?
ixgbe_alloc_rx_buffers+0x9a/0x450 [ixgbe]
May 10 15:26:40 os02 kernel: [ 3652.485341]  [<ffffffff8106b745>] ?
__do_softirq+0xb5/0x210
May 10 15:26:40 os02 kernel: [ 3652.485344]  [<ffffffff81474840>] ?
napi_skb_finish+0x50/0x70
May 10 15:26:40 os02 kernel: [ 3652.485348]  [<ffffffff810c7ca4>] ?
handle_IRQ_event+0x54/0x180
May 10 15:26:40 os02 kernel: [ 3652.485354]  [<ffffffffa0060930>] ?
ixgbe_poll+0x1140/0x1670 [ixgbe]
May 10 15:26:40 os02 kernel: [ 3652.485357]  [<ffffffff8106b7bd>] ?
__do_softirq+0x12d/0x210
May 10 15:26:40 os02 kernel: [ 3652.485360]  [<ffffffff810f33eb>] ?
perf_pmu_enable+0x2b/0x40
May 10 15:26:40 os02 kernel: [ 3652.485364]  [<ffffffff8100cf3c>] ?
call_softirq+0x1c/0x30
May 10 15:26:40 os02 kernel: [ 3652.485367]  [<ffffffff81474eb2>] ?
net_rx_action+0x102/0x2a0
May 10 15:26:40 os02 kernel: [ 3652.485369]  [<ffffffff8100e975>] ?
do_softirq+0x65/0xa0
May 10 15:26:40 os02 kernel: [ 3652.485372]  [<ffffffff8106b745>] ?
__do_softirq+0xb5/0x210
May 10 15:26:40 os02 kernel: [ 3652.485375]  [<ffffffff8106b605>] ?
irq_exit+0x95/0xa0
May 10 15:26:40 os02 kernel: [ 3652.485379]  [<ffffffff810c7ca4>] ?
handle_IRQ_event+0x54/0x180
May 10 15:26:40 os02 kernel: [ 3652.485383]  [<ffffffff8154a276>] ?
do_IRQ+0x66/0xe0
May 10 15:26:40 os02 kernel: [ 3652.485386]  [<ffffffff8106b7bd>] ?
__do_softirq+0x12d/0x210
May 10 15:26:40 os02 kernel: [ 3652.485389]  [<ffffffff81542a53>] ?
ret_from_intr+0x0/0x15
May 10 15:26:40 os02 kernel: [ 3652.485391]  <EOI>
[<ffffffff8100cf3c>] ? call_softirq+0x1c/0x30
May 10 15:26:40 os02 kernel: [ 3652.485397]  [<ffffffff81110a54>] ?
shrink_inactive_list+0x164/0x460
May 10 15:26:40 os02 kernel: [ 3652.485400]  [<ffffffff8100e975>] ?
do_softirq+0x65/0xa0
May 10 15:26:40 os02 kernel: [ 3652.485404]  [<ffffffff8153facc>] ?
schedule+0x44c/0xa10
May 10 15:26:40 os02 kernel: [ 3652.485407]  [<ffffffff8106b605>] ?
irq_exit+0x95/0xa0
May 10 15:26:40 os02 kernel: [ 3652.485412]  [<ffffffff81109b1a>] ?
determine_dirtyable_memory+0x1a/0x30
May 10 15:26:40 os02 kernel: [ 3652.485416]  [<ffffffff8154a276>] ?
do_IRQ+0x66/0xe0
May 10 15:26:40 os02 kernel: [ 3652.485419]  [<ffffffff81111453>] ?
shrink_zone+0x3d3/0x530
May 10 15:26:40 os02 kernel: [ 3652.485422]  [<ffffffff81542a53>] ?
ret_from_intr+0x0/0x15
May 10 15:26:40 os02 kernel: [ 3652.485423]  <EOI>
[<ffffffff81074a4a>] ? del_timer_sync+0x3a/0x60
May 10 15:26:40 os02 kernel: [ 3652.485430]  [<ffffffff812a774d>] ?
copy_user_generic_string+0x2d/0x40
May 10 15:26:40 os02 kernel: [ 3652.485435]  [<ffffffff811054a5>] ?
zone_watermark_ok_safe+0xb5/0xd0
May 10 15:26:40 os02 kernel: [ 3652.485439]  [<ffffffff810ff351>] ?
iov_iter_copy_from_user_atomic+0x101/0x170
May 10 15:26:40 os02 kernel: [ 3652.485442]  [<ffffffff81112a69>] ?
kswapd+0x889/0xb20
May 10 15:26:40 os02 kernel: [ 3652.485457]  [<ffffffffa026c91d>] ?
btrfs_copy_from_user+0xcd/0x130 [btrfs]
May 10 15:26:40 os02 kernel: [ 3652.485460]  [<ffffffff811121e0>] ?
kswapd+0x0/0xb20
May 10 15:26:40 os02 kernel: [ 3652.485472]  [<ffffffffa026d844>] ?
__btrfs_buffered_write+0x1a4/0x330 [btrfs]
May 10 15:26:40 os02 kernel: [ 3652.485476]  [<ffffffff810862b6>] ?
kthread+0x96/0xa0
May 10 15:26:40 os02 kernel: [ 3652.485479]  [<ffffffff8117151f>] ?
file_update_time+0x5f/0x170
May 10 15:26:40 os02 kernel: [ 3652.485482]  [<ffffffff8100ce44>] ?
kernel_thread_helper+0x4/0x10
May 10 15:26:40 os02 kernel: [ 3652.485493]  [<ffffffffa026dc08>] ?
btrfs_file_aio_write+0x238/0x4e0 [btrfs]
May 10 15:26:40 os02 kernel: [ 3652.485496]  [<ffffffff81086220>] ?
kthread+0x0/0xa0
May 10 15:26:40 os02 kernel: [ 3652.485507]  [<ffffffffa026d9d0>] ?
btrfs_file_aio_write+0x0/0x4e0 [btrfs]
May 10 15:26:40 os02 kernel: [ 3652.485511]  [<ffffffff8100ce40>] ?
kernel_thread_helper+0x0/0x10
May 10 15:26:40 os02 kernel: [ 3652.485515]  [<ffffffff81158ff3>] ?
do_sync_readv_writev+0xd3/0x110
May 10 15:26:40 os02 kernel: [ 3652.485516] Mem-Info:
May 10 15:26:40 os02 kernel: [ 3652.485519]  [<ffffffff81163d42>] ?
path_put+0x22/0x30
May 10 15:26:40 os02 kernel: [ 3652.485521] Node 0 DMA per-cpu:
May 10 15:26:40 os02 kernel: [ 3652.485525]  [<ffffffff812584a3>] ?
selinux_file_permission+0xf3/0x150
May 10 15:26:40 os02 kernel: [ 3652.485528] CPU    0: hi:    0, btch:
 1 usd:   0
May 10 15:26:40 os02 kernel: [ 3652.485530] CPU    1: hi:    0, btch:
 1 usd:   0
May 10 15:26:40 os02 kernel: [ 3652.485534]  [<ffffffff81251583>] ?
security_file_permission+0x23/0x90
May 10 15:26:40 os02 kernel: [ 3652.485535] CPU    2: hi:    0, btch:
 1 usd:   0
May 10 15:26:40 os02 kernel: [ 3652.485538] CPU    3: hi:    0, btch:
 1 usd:   0
May 10 15:26:40 os02 kernel: [ 3652.485542]  [<ffffffff81159f14>] ?
do_readv_writev+0xd4/0x1e0
May 10 15:26:40 os02 kernel: [ 3652.485544] CPU    4: hi:    0, btch:
 1 usd:   0
May 10 15:26:40 os02 kernel: [ 3652.485547] CPU    5: hi:    0, btch:
 1 usd:   0
May 10 15:26:40 os02 kernel: [ 3652.485550]  [<ffffffff81540d91>] ?
mutex_lock+0x31/0x60
May 10 15:26:40 os02 kernel: [ 3652.485552] CPU    6: hi:    0, btch:
 1 usd:   0
May 10 15:26:40 os02 kernel: [ 3652.485554] CPU    7: hi:    0, btch:
 1 usd:   0
May 10 15:26:40 os02 kernel: [ 3652.485557]  [<ffffffff8115a066>] ?
vfs_writev+0x46/0x60
May 10 15:26:40 os02 kernel: [ 3652.485558] Node 0 DMA32 per-cpu:
May 10 15:26:40 os02 kernel: [ 3652.485562]  [<ffffffff8115a1a1>] ?
sys_writev+0x51/0xc0
May 10 15:26:40 os02 kernel: [ 3652.485564] CPU    0: hi:  186, btch:
31 usd: 144
May 10 15:26:40 os02 kernel: [ 3652.485567] CPU    1: hi:  186, btch:
31 usd: 198
May 10 15:26:40 os02 kernel: [ 3652.485571]  [<ffffffff8100c002>] ?
system_call_fastpath+0x16/0x1b
May 10 15:26:40 os02 kernel: [ 3652.485573] CPU    2: hi:  186, btch:
31 usd: 180
May 10 15:26:40 os02 kernel: [ 3652.485574] Mem-Info:
May 10 15:26:40 os02 kernel: [ 3652.485576] CPU    3: hi:  186, btch:
31 usd: 171
May 10 15:26:40 os02 kernel: [ 3652.485578] Node 0 CPU    4: hi:  186,
btch:  31 usd: 159
May 10 15:26:40 os02 kernel: [ 3652.485581] DMA per-cpu:
May 10 15:26:40 os02 kernel: [ 3652.485582] CPU    5: hi:  186, btch:
31 usd:  69
May 10 15:26:40 os02 kernel: [ 3652.485585] CPU    0: hi:    0, btch:
 1 usd:   0
May 10 15:26:40 os02 kernel: [ 3652.485587] CPU    6: hi:  186, btch:
31 usd: 180
May 10 15:26:40 os02 kernel: [ 3652.485589] CPU    1: hi:    0, btch:
 1 usd:   0
May 10 15:26:40 os02 kernel: [ 3652.485591] CPU    7: hi:  186, btch:
31 usd: 184
May 10 15:26:40 os02 kernel: [ 3652.485593] CPU    2: hi:    0, btch:
 1 usd:   0
May 10 15:26:40 os02 kernel: [ 3652.485594] Node 0 CPU    3: hi:    0,
btch:   1 usd:   0
May 10 15:26:40 os02 kernel: [ 3652.485597] Normal per-cpu:
May 10 15:26:40 os02 kernel: [ 3652.485598] CPU    4: hi:    0, btch:
 1 usd:   0
May 10 15:26:40 os02 kernel: [ 3652.485600] CPU    0: hi:  186, btch:
31 usd: 100
May 10 15:26:40 os02 kernel: [ 3652.485602] CPU    5: hi:    0, btch:
 1 usd:   0
May 10 15:26:40 os02 kernel: [ 3652.485604] CPU    1: hi:  186, btch:
31 usd:  47
May 10 15:26:40 os02 kernel: [ 3652.485606] CPU    6: hi:    0, btch:
 1 usd:   0
May 10 15:26:40 os02 kernel: [ 3652.485608] CPU    2: hi:  186, btch:
31 usd: 168
May 10 15:26:40 os02 kernel: [ 3652.485610] CPU    7: hi:    0, btch:
 1 usd:   0
May 10 15:26:40 os02 kernel: [ 3652.485612] CPU    3: hi:  186, btch:
31 usd: 140
May 10 15:26:40 os02 kernel: [ 3652.485614] Node 0 CPU    4: hi:  186,
btch:  31 usd: 177
May 10 15:26:40 os02 kernel: [ 3652.485617] DMA32 per-cpu:
May 10 15:26:40 os02 kernel: [ 3652.485618] CPU    5: hi:  186, btch:
31 usd:  77
May 10 15:26:40 os02 kernel: [ 3652.485621] CPU    0: hi:  186, btch:
31 usd: 144
May 10 15:26:40 os02 kernel: [ 3652.485623] CPU    6: hi:  186, btch:
31 usd: 168
May 10 15:26:40 os02 kernel: [ 3652.485625] CPU    1: hi:  186, btch:
31 usd: 198
May 10 15:26:40 os02 kernel: [ 3652.485627] CPU    7: hi:  186, btch:
31 usd:  68
May 10 15:26:40 os02 kernel: [ 3652.485629] CPU    2: hi:  186, btch:
31 usd: 180
May 10 15:26:40 os02 kernel: [ 3652.485634] active_anon:255806
inactive_anon:19454 isolated_anon:0
May 10 15:26:40 os02 kernel: [ 3652.485636]  active_file:420093
inactive_file:5180559 isolated_file:0
May 10 15:26:40 os02 kernel: [ 3652.485637]  unevictable:50582
dirty:314034 writeback:8484 unstable:0
May 10 15:26:40 os02 kernel: [ 3652.485639]  free:30074
slab_reclaimable:35739 slab_unreclaimable:13526
May 10 15:26:40 os02 kernel: [ 3652.485641]  mapped:3440 shmem:51
pagetables:1342 bounce:0
May 10 15:26:40 os02 kernel: [ 3652.485643] CPU    3: hi:  186, btch:
31 usd: 171
May 10 15:26:40 os02 kernel: [ 3652.485644] Node 0 CPU    4: hi:  186,
btch:  31 usd: 159
May 10 15:26:40 os02 kernel: [ 3652.485652] DMA free:15852kB min:12kB
low:12kB high:16kB active_anon:0kB inactive_anon:0kB active_file:0kB
inactive_file:0kB unevictable:0kB isolated(anon):0kB
isolated(file):0kB present:15660kB mlocked:0kB dirty:0kB writeback:0kB
mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB
kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB
writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
May 10 15:26:40 os02 kernel: [ 3652.485659] CPU    5: hi:  186, btch:
31 usd:  69
May 10 15:26:40 os02 kernel: [ 3652.485661] lowmem_reserve[]:CPU    6:
hi:  186, btch:  31 usd: 180
May 10 15:26:40 os02 kernel: [ 3652.485663]  0CPU    7: hi:  186,
btch:  31 usd: 184
May 10 15:26:40 os02 kernel: [ 3652.485665]  2991Node 0  24201Normal per-cpu:
May 10 15:26:40 os02 kernel: [ 3652.485668]  24201CPU    0: hi:  186,
btch:  31 usd: 100
May 10 15:26:40 os02 kernel: [ 3652.485671]
May 10 15:26:40 os02 kernel: [ 3652.485672] CPU    1: hi:  186, btch:
31 usd:  47
May 10 15:26:40 os02 kernel: [ 3652.485674] Node 0 CPU    2: hi:  186,
btch:  31 usd: 168
May 10 15:26:40 os02 kernel: [ 3652.485682] DMA32 free:85748kB
min:2460kB low:3072kB high:3688kB active_anon:20480kB
inactive_anon:5268kB active_file:151588kB inactive_file:2645188kB
unevictable:72kB isolated(anon):0kB isolated(file):0kB
present:3063392kB mlocked:0kB dirty:210820kB writeback:0kB
mapped:648kB shmem:0kB slab_reclaimable:28400kB
slab_unreclaimable:2152kB kernel_stack:520kB pagetables:100kB
unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0
all_unreclaimable? no
May 10 15:26:40 os02 kernel: [ 3652.485690] CPU    3: hi:  186, btch:
31 usd: 140
May 10 15:26:40 os02 kernel: [ 3652.485691] lowmem_reserve[]:CPU    4:
hi:  186, btch:  31 usd: 177
May 10 15:26:40 os02 kernel: [ 3652.485693]  0CPU    5: hi:  186,
btch:  31 usd:  77
May 10 15:26:40 os02 kernel: [ 3652.485696]  0CPU    6: hi:  186,
btch:  31 usd: 168
May 10 15:26:40 os02 kernel: [ 3652.485698]  21210CPU    7: hi:  186,
btch:  31 usd:  68
May 10 15:26:40 os02 kernel: [ 3652.485701]  21210active_anon:255806
inactive_anon:19454 isolated_anon:0
May 10 15:26:40 os02 kernel: [ 3652.485705]  active_file:420093
inactive_file:5180559 isolated_file:0
May 10 15:26:40 os02 kernel: [ 3652.485706]  unevictable:50582
dirty:314034 writeback:8484 unstable:0
May 10 15:26:40 os02 kernel: [ 3652.485707]  free:30074
slab_reclaimable:35739 slab_unreclaimable:13526
May 10 15:26:40 os02 kernel: [ 3652.485708]  mapped:3440 shmem:51
pagetables:1342 bounce:0
May 10 15:26:40 os02 kernel: [ 3652.485709]
May 10 15:26:40 os02 kernel: [ 3652.485710] Node 0 Node 0 DMA
free:15852kB min:12kB low:12kB high:16kB active_anon:0kB
inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB
isolated(anon):0kB isolated(file):0kB present:15660kB mlocked:0kB
dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB
bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
May 10 15:26:40 os02 kernel: [ 3652.485724] Normal free:18696kB
min:17440kB low:21800kB high:26160kB active_anon:1002744kB
inactive_anon:72548kB active_file:1528784kB inactive_file:18077048kB
unevictable:202256kB isolated(anon):0kB isolated(file):0kB
present:21719040kB mlocked:0kB dirty:1045316kB writeback:33936kB
mapped:13112kB shmem:204kB slab_reclaimable:114556kB
slab_unreclaimable:51952kB kernel_stack:3768kB pagetables:5268kB
unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:32
all_unreclaimable? no
May 10 15:26:40 os02 kernel: [ 3652.485731]
lowmem_reserve[]:lowmem_reserve[]: 0 0 2991 0 24201 0 24201 0
May 10 15:26:40 os02 kernel: [ 3652.485737]
May 10 15:26:40 os02 kernel: [ 3652.485738] Node 0 Node 0 DMA32
free:85748kB min:2460kB low:3072kB high:3688kB active_anon:20480kB
inactive_anon:5268kB active_file:151588kB inactive_file:2645188kB
unevictable:72kB isolated(anon):0kB isolated(file):0kB
present:3063392kB mlocked:0kB dirty:210820kB writeback:0kB
mapped:648kB shmem:0kB slab_reclaimable:28400kB
slab_unreclaimable:2152kB kernel_stack:520kB pagetables:100kB
unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0
all_unreclaimable? no
May 10 15:26:40 os02 kernel: [ 3652.485747] DMA:
lowmem_reserve[]:1*4kB  01*8kB  00*16kB  212101*32kB  212101*64kB
May 10 15:26:40 os02 kernel: [ 3652.485754] 1*128kB Node 0 1*256kB
Normal free:18696kB min:17440kB low:21800kB high:26160kB
active_anon:1002744kB inactive_anon:72548kB active_file:1528784kB
inactive_file:18077048kB unevictable:202256kB isolated(anon):0kB
isolated(file):0kB present:21719040kB mlocked:0kB dirty:1045316kB
writeback:33936kB mapped:13112kB shmem:204kB slab_reclaimable:114556kB
slab_unreclaimable:51952kB kernel_stack:3768kB pagetables:5268kB
unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:32
all_unreclaimable? no
May 10 15:26:40 os02 kernel: [ 3652.485764] 0*512kB
lowmem_reserve[]:1*1024kB  01*2048kB  03*4096kB  0= 15852kB
May 10 15:26:40 os02 kernel: [ 3652.485771]  0Node 0
May 10 15:26:40 os02 kernel: [ 3652.485773] DMA32: Node 0 59*4kB DMA:
125*8kB 1*4kB 66*16kB 1*8kB 80*32kB 0*16kB 188*64kB 1*32kB 51*128kB
1*64kB 15*256kB 1*128kB 40*512kB 1*256kB 31*1024kB 0*512kB 1*2048kB
1*1024kB 1*4096kB 1*2048kB = 85620kB
May 10 15:26:40 os02 kernel: [ 3652.485789] 3*4096kB Node 0 = 15852kB
May 10 15:26:40 os02 kernel: [ 3652.485791] Normal: Node 0 3930*4kB
DMA32: 0*8kB 59*4kB 1*16kB 125*8kB 0*32kB 66*16kB 0*64kB 80*32kB
0*128kB 188*64kB 1*256kB 51*128kB 1*512kB 15*256kB 0*1024kB 40*512kB
1*2048kB 31*1024kB 0*4096kB 1*2048kB = 18552kB
May 10 15:26:40 os02 kernel: [ 3652.485807] 1*4096kB 5651289 total
pagecache pages
May 10 15:26:40 os02 kernel: [ 3652.485809] = 85620kB
May 10 15:26:40 os02 kernel: [ 3652.485810] 0 pages in swap cache
May 10 15:26:40 os02 kernel: [ 3652.485811] Node 0 Swap cache stats:
add 0, delete 0, find 0/0
May 10 15:26:40 os02 kernel: [ 3652.485814] Normal: Free swap  = 1048572kB
May 10 15:26:40 os02 kernel: [ 3652.485815] 3930*4kB Total swap = 1048572kB
May 10 15:26:40 os02 kernel: [ 3652.485817] 0*8kB 1*16kB 0*32kB 0*64kB
0*128kB 1*256kB 1*512kB 0*1024kB 1*2048kB 0*4096kB = 18552kB
May 10 15:26:40 os02 kernel: [ 3652.485822] 5651289 total pagecache pages
May 10 15:26:40 os02 kernel: [ 3652.485823] 0 pages in swap cache
May 10 15:26:40 os02 kernel: [ 3652.485824] Swap cache stats: add 0,
delete 0, find 0/0
May 10 15:26:40 os02 kernel: [ 3652.485825] Free swap  = 1048572kB
May 10 15:26:40 os02 kernel: [ 3652.485826] Total swap = 1048572kB
May 10 15:26:40 os02 kernel: [ 3652.486439] kworker/0:1: page
allocation failure. order:2, mode:0x4020
May 10 15:26:40 os02 kernel: [ 3652.486443] Pid: 0, comm: kworker/0:1
Tainted: P        W   2.6.38.6-1.fits.1.el6.x86_64 #1
May 10 15:26:40 os02 kernel: [ 3652.486446] Call Trace:
May 10 15:26:40 os02 kernel: [ 3652.486448]  <IRQ>
[<ffffffff81108ce7>] ? __alloc_pages_nodemask+0x6f7/0x8a0
May 10 15:26:40 os02 kernel: [ 3652.486459]  [<ffffffff814b0ad0>] ?
ip_local_deliver+0x80/0x90
May 10 15:26:40 os02 kernel: [ 3652.486464]  [<ffffffff81146cd2>] ?
kmalloc_large_node+0x62/0xb0
May 10 15:26:40 os02 kernel: [ 3652.486468]  [<ffffffff8114becb>] ?
__kmalloc_node_track_caller+0x15b/0x1d0
May 10 15:26:40 os02 kernel: [ 3652.486473]  [<ffffffff81466f74>] ?
__netdev_alloc_skb+0x24/0x50
May 10 15:26:40 os02 kernel: [ 3652.486476]  [<ffffffff81466713>] ?
__alloc_skb+0x83/0x170
May 10 15:26:40 os02 kernel: [ 3652.486479]  [<ffffffff81466f74>] ?
__netdev_alloc_skb+0x24/0x50
May 10 15:26:40 os02 kernel: [ 3652.486489]  [<ffffffffa005d9aa>] ?
ixgbe_alloc_rx_buffers+0x9a/0x450 [ixgbe]
May 10 15:26:40 os02 kernel: [ 3652.486494]  [<ffffffff81474840>] ?
napi_skb_finish+0x50/0x70
May 10 15:26:40 os02 kernel: [ 3652.486501]  [<ffffffffa0060930>] ?
ixgbe_poll+0x1140/0x1670 [ixgbe]
May 10 15:26:40 os02 kernel: [ 3652.486506]  [<ffffffff81013379>] ?
sched_clock+0x9/0x10
May 10 15:26:40 os02 kernel: [ 3652.486510]  [<ffffffff81474eb2>] ?
net_rx_action+0x102/0x2a0
May 10 15:26:40 os02 kernel: [ 3652.486514]  [<ffffffff8106b745>] ?
__do_softirq+0xb5/0x210
May 10 15:26:40 os02 kernel: [ 3652.486520]  [<ffffffff8108aec4>] ?
hrtimer_interrupt+0x134/0x240
May 10 15:26:40 os02 kernel: [ 3652.486523]  [<ffffffff8100cf3c>] ?
call_softirq+0x1c/0x30
May 10 15:26:40 os02 kernel: [ 3652.486526]  [<ffffffff8100e975>] ?
do_softirq+0x65/0xa0
May 10 15:26:40 os02 kernel: [ 3652.486529]  [<ffffffff8106b605>] ?
irq_exit+0x95/0xa0
May 10 15:26:40 os02 kernel: [ 3652.486533]  [<ffffffff8154a360>] ?
smp_apic_timer_interrupt+0x70/0x9b
May 10 15:26:40 os02 kernel: [ 3652.486536]  [<ffffffff8100c9f3>] ?
apic_timer_interrupt+0x13/0x20
May 10 15:26:40 os02 kernel: [ 3652.486538]  <EOI>
[<ffffffff812db311>] ? intel_idle+0xc1/0x120
May 10 15:26:40 os02 kernel: [ 3652.486544]  [<ffffffff812db2f4>] ?
intel_idle+0xa4/0x120
May 10 15:26:40 os02 kernel: [ 3652.486549]  [<ffffffff8143bca5>] ?
cpuidle_idle_call+0xb5/0x240
May 10 15:26:40 os02 kernel: [ 3652.486554]  [<ffffffff8100aa87>] ?
cpu_idle+0xb7/0x110
May 10 15:26:40 os02 kernel: [ 3652.486558]  [<ffffffff81538ffe>] ?
start_secondary+0x21f/0x221
May 10 15:26:40 os02 kernel: [ 3652.486561] Mem-Info:
May 10 15:26:40 os02 kernel: [ 3652.486562] Node 0 DMA per-cpu:
May 10 15:26:40 os02 kernel: [ 3652.486564] CPU    0: hi:    0, btch:
 1 usd:   0
May 10 15:26:40 os02 kernel: [ 3652.486567] CPU    1: hi:    0, btch:
 1 usd:   0
May 10 15:26:40 os02 kernel: [ 3652.486569] CPU    2: hi:    0, btch:
 1 usd:   0
May 10 15:26:40 os02 kernel: [ 3652.486571] CPU    3: hi:    0, btch:
 1 usd:   0
May 10 15:26:40 os02 kernel: [ 3652.486573] CPU    4: hi:    0, btch:
 1 usd:   0
May 10 15:26:40 os02 kernel: [ 3652.486575] CPU    5: hi:    0, btch:
 1 usd:   0
May 10 15:26:40 os02 kernel: [ 3652.486578] CPU    6: hi:    0, btch:
 1 usd:   0
May 10 15:26:40 os02 kernel: [ 3652.486580] CPU    7: hi:    0, btch:
 1 usd:   0
May 10 15:26:40 os02 kernel: [ 3652.486581] Node 0 DMA32 per-cpu:
May 10 15:26:40 os02 kernel: [ 3652.486584] CPU    0: hi:  186, btch:
31 usd: 144
May 10 15:26:40 os02 kernel: [ 3652.486586] CPU    1: hi:  186, btch:
31 usd: 198
May 10 15:26:40 os02 kernel: [ 3652.486588] CPU    2: hi:  186, btch:
31 usd: 180
May 10 15:26:40 os02 kernel: [ 3652.486590] CPU    3: hi:  186, btch:
31 usd: 172
May 10 15:26:40 os02 kernel: [ 3652.486593] CPU    4: hi:  186, btch:
31 usd: 159
May 10 15:26:40 os02 kernel: [ 3652.486595] CPU    5: hi:  186, btch:
31 usd:  69
May 10 15:26:40 os02 kernel: [ 3652.486597] CPU    6: hi:  186, btch:
31 usd: 180
May 10 15:26:40 os02 kernel: [ 3652.486599] CPU    7: hi:  186, btch:
31 usd: 184
May 10 15:26:40 os02 kernel: [ 3652.486601] Node 0 Normal per-cpu:
May 10 15:26:40 os02 kernel: [ 3652.486603] CPU    0: hi:  186, btch:
31 usd: 162
May 10 15:26:40 os02 kernel: [ 3652.486605] CPU    1: hi:  186, btch:
31 usd:  47
May 10 15:26:40 os02 kernel: [ 3652.486608] CPU    2: hi:  186, btch:
31 usd: 168
May 10 15:26:40 os02 kernel: [ 3652.486610] CPU    3: hi:  186, btch:
31 usd: 141
May 10 15:26:40 os02 kernel: [ 3652.486612] CPU    4: hi:  186, btch:
31 usd: 177
May 10 15:26:40 os02 kernel: [ 3652.486614] CPU    5: hi:  186, btch:
31 usd:  77
May 10 15:26:40 os02 kernel: [ 3652.486616] CPU    6: hi:  186, btch:
31 usd: 168
May 10 15:26:40 os02 kernel: [ 3652.486618] CPU    7: hi:  186, btch:
31 usd: 174
May 10 15:26:40 os02 kernel: [ 3652.486624] active_anon:255806
inactive_anon:19454 isolated_anon:0
May 10 15:26:40 os02 kernel: [ 3652.486625]  active_file:420093
inactive_file:5180745 isolated_file:0
May 10 15:26:40 os02 kernel: [ 3652.486627]  unevictable:50582
dirty:314470 writeback:8484 unstable:0
May 10 15:26:40 os02 kernel: [ 3652.486628]  free:29795
slab_reclaimable:35739 slab_unreclaimable:13526
May 10 15:26:40 os02 kernel: [ 3652.486629]  mapped:3440 shmem:51
pagetables:1342 bounce:0
May 10 15:26:40 os02 kernel: [ 3652.486631] Node 0 DMA free:15852kB
min:12kB low:12kB high:16kB active_anon:0kB inactive_anon:0kB
active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB
isolated(file):0kB present:15660kB mlocked:0kB dirty:0kB writeback:0kB
mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB
kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB
writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
May 10 15:26:40 os02 kernel: [ 3652.486642] lowmem_reserve[]: 0 2991 24201 24201
May 10 15:26:40 os02 kernel: [ 3652.486645] Node 0 DMA32 free:85748kB
min:2460kB low:3072kB high:3688kB active_anon:20480kB
inactive_anon:5268kB active_file:151588kB inactive_file:2645188kB
unevictable:72kB isolated(anon):0kB isolated(file):0kB
present:3063392kB mlocked:0kB dirty:210820kB writeback:0kB
mapped:648kB shmem:0kB slab_reclaimable:28400kB
slab_unreclaimable:2152kB kernel_stack:520kB pagetables:100kB
unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0
all_unreclaimable? no
May 10 15:26:40 os02 kernel: [ 3652.486657] lowmem_reserve[]: 0 0 21210 21210
May 10 15:26:40 os02 kernel: [ 3652.486660] Node 0 Normal free:17580kB
min:17440kB low:21800kB high:26160kB active_anon:1002744kB
inactive_anon:72548kB active_file:1528784kB inactive_file:18077792kB
unevictable:202256kB isolated(anon):0kB isolated(file):0kB
present:21719040kB mlocked:0kB dirty:1047060kB writeback:33936kB
mapped:13112kB shmem:204kB slab_reclaimable:114556kB
slab_unreclaimable:51952kB kernel_stack:3768kB pagetables:5268kB
unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:64
all_unreclaimable? no
May 10 15:26:40 os02 kernel: [ 3652.486673] lowmem_reserve[]: 0 0 0 0
May 10 15:26:40 os02 kernel: [ 3652.486675] Node 0 DMA: 1*4kB 1*8kB
0*16kB 1*32kB 1*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB
3*4096kB = 15852kB
May 10 15:26:40 os02 kernel: [ 3652.486684] Node 0 DMA32: 59*4kB
125*8kB 66*16kB 80*32kB 188*64kB 51*128kB 15*256kB 40*512kB 31*1024kB
1*2048kB 1*4096kB = 85620kB
May 10 15:26:40 os02 kernel: [ 3652.486692] Node 0 Normal: 3705*4kB
12*8kB 16*16kB 4*32kB 1*64kB 0*128kB 1*256kB 1*512kB 0*1024kB 1*2048kB
0*4096kB = 18180kB
May 10 15:26:40 os02 kernel: [ 3652.486700] 5651289 total pagecache pages
May 10 15:26:40 os02 kernel: [ 3652.486702] 0 pages in swap cache
May 10 15:26:40 os02 kernel: [ 3652.486704] Swap cache stats: add 0,
delete 0, find 0/0
May 10 15:26:40 os02 kernel: [ 3652.486705] Free swap  = 1048572kB
May 10 15:26:40 os02 kernel: [ 3652.486707] Total swap = 1048572kB
May 10 15:26:40 os02 kernel: [ 3652.562795] 6291440 pages RAM
May 10 15:26:40 os02 kernel: [ 3652.562798] 108688 pages reserved
May 10 15:26:40 os02 kernel: [ 3652.562799] 5429575 pages shared
May 10 15:26:40 os02 kernel: [ 3652.562801] 783596 pages non-shared
May 10 15:26:40 os02 kernel: [ 3652.651570] 6291440 pages RAM
May 10 15:26:40 os02 kernel: [ 3652.651572] 108688 pages reserved
May 10 15:26:40 os02 kernel: [ 3652.651573] 5430055 pages shared
May 10 15:26:40 os02 kernel: [ 3652.651575] 782974 pages non-shared
May 10 15:26:40 os02 kernel: [ 3652.721553] 6291440 pages RAM
May 10 15:26:40 os02 kernel: [ 3652.721555] 108688 pages reserved
May 10 15:26:40 os02 kernel: [ 3652.721556] 5430961 pages shared
May 10 15:26:40 os02 kernel: [ 3652.721557] 781496 pages non-shared
May 10 15:26:40 os02 kernel: [ 3654.349865] Pid: 1846, comm: cosd
Tainted: P        W   2.6.38.6-1.fits.1.el6.x86_64 #1
May 10 15:26:40 os02 kernel: [ 3654.358792] Call Trace:
May 10 15:26:40 os02 kernel: [ 3654.361519]  <IRQ>
[<ffffffff81108ce7>] ? __alloc_pages_nodemask+0x6f7/0x8a0
May 10 15:26:40 os02 kernel: [ 3654.369495]  [<ffffffff814b0ad0>] ?
ip_local_deliver+0x80/0x90
May 10 15:26:40 os02 kernel: [ 3654.376005]  [<ffffffff81146cd2>] ?
kmalloc_large_node+0x62/0xb0
May 10 15:26:40 os02 kernel: [ 3654.382703]  [<ffffffff8114becb>] ?
__kmalloc_node_track_caller+0x15b/0x1d0
May 10 15:26:40 os02 kernel: [ 3654.390464]  [<ffffffff81466f74>] ?
__netdev_alloc_skb+0x24/0x50
May 10 15:26:40 os02 kernel: [ 3654.397163]  [<ffffffff81466713>] ?
__alloc_skb+0x83/0x170
May 10 15:26:40 os02 kernel: [ 3654.403277]  [<ffffffff81466f74>] ?
__netdev_alloc_skb+0x24/0x50
May 10 15:26:40 os02 kernel: [ 3654.409970]  [<ffffffffa005d9aa>] ?
ixgbe_alloc_rx_buffers+0x9a/0x450 [ixgbe]
May 10 15:26:40 os02 kernel: [ 3654.417926]  [<ffffffff812b79e0>] ?
swiotlb_map_page+0x0/0x110
May 10 15:26:40 os02 kernel: [ 3654.424432]  [<ffffffffa0060930>] ?
ixgbe_poll+0x1140/0x1670 [ixgbe]
May 10 15:26:40 os02 kernel: [ 3654.431518]  [<ffffffff810f33eb>] ?
perf_pmu_enable+0x2b/0x40
May 10 15:26:40 os02 kernel: [ 3654.437924]  [<ffffffff81474eb2>] ?
net_rx_action+0x102/0x2a0
May 10 15:26:40 os02 kernel: [ 3654.444329]  [<ffffffff8106b745>] ?
__do_softirq+0xb5/0x210
May 10 15:26:40 os02 kernel: [ 3654.450541]  [<ffffffff810c7ca4>] ?
handle_IRQ_event+0x54/0x180
May 10 15:26:40 os02 kernel: [ 3654.457138]  [<ffffffff8106b7bd>] ?
__do_softirq+0x12d/0x210
May 10 15:26:40 os02 kernel: [ 3654.463446]  [<ffffffff8100cf3c>] ?
call_softirq+0x1c/0x30
May 10 15:26:40 os02 kernel: [ 3654.469562]  [<ffffffff8100e975>] ?
do_softirq+0x65/0xa0
May 10 15:26:40 os02 kernel: [ 3654.475484]  [<ffffffff8106b605>] ?
irq_exit+0x95/0xa0
May 10 15:26:40 os02 kernel: [ 3654.481218]  [<ffffffff8154a276>] ?
do_IRQ+0x66/0xe0
May 10 15:26:40 os02 kernel: [ 3654.486754]  [<ffffffff81542a53>] ?
ret_from_intr+0x0/0x15
May 10 15:26:40 os02 kernel: [ 3654.492867]  <EOI>
[<ffffffff81286919>] ? __make_request+0x149/0x4c0
May 10 15:26:40 os02 kernel: [ 3654.500061]  [<ffffffff812868e4>] ?
__make_request+0x114/0x4c0
May 10 15:26:41 os02 kernel: [ 3654.506565]  [<ffffffff812841bd>] ?
generic_make_request+0x2fd/0x5e0
May 10 15:26:41 os02 kernel: [ 3654.513649]  [<ffffffff8142742b>] ?
dm_get_live_table+0x4b/0x60
May 10 15:26:41 os02 kernel: [ 3654.520248]  [<ffffffff81427bc1>] ?
dm_merge_bvec+0xc1/0x140
May 10 15:26:41 os02 kernel: [ 3654.526555]  [<ffffffff81284526>] ?
submit_bio+0x86/0x110
May 10 15:26:41 os02 kernel: [ 3654.532574]  [<ffffffff8118deac>] ?
dio_bio_submit+0xbc/0xc0
May 10 15:26:41 os02 kernel: [ 3654.538881]  [<ffffffff8118df40>] ?
dio_send_cur_page+0x90/0xc0
May 10 15:26:41 os02 kernel: [ 3654.545478]  [<ffffffff8118dfd5>] ?
submit_page_section+0x65/0x180
May 10 15:26:41 os02 kernel: [ 3654.552370]  [<ffffffff8118e918>] ?
__blockdev_direct_IO+0x678/0xb30
May 10 15:26:41 os02 kernel: [ 3654.559454]  [<ffffffff81250eaf>] ?
security_inode_getsecurity+0x1f/0x30
May 10 15:26:41 os02 kernel: [ 3654.566924]  [<ffffffff8118c627>] ?
blkdev_direct_IO+0x57/0x60
May 10 15:26:41 os02 kernel: [ 3654.573414]  [<ffffffff8118b760>] ?
blkdev_get_blocks+0x0/0xc0
May 10 15:26:41 os02 kernel: [ 3654.579954]  [<ffffffff811008f2>] ?
generic_file_direct_write+0xc2/0x190
May 10 15:26:41 os02 kernel: [ 3654.587424]  [<ffffffff811715b6>] ?
file_update_time+0xf6/0x170
May 10 15:26:41 os02 kernel: [ 3654.594025]  [<ffffffff811023eb>] ?
__generic_file_aio_write+0x32b/0x460
May 10 15:26:41 os02 kernel: [ 3654.601494]  [<ffffffff8105c9e0>] ?
wake_up_state+0x10/0x20



and so on.

--
Stefan Majer



--
Stefan Majer
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [Bugme-new] [Bug 33502] New: Caught 64-bit read from uninitialized memory in __alloc_skb
From: Christoph Lameter @ 2011-05-10 19:05 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Vegard Nossum, Pekka Enberg, casteyde.christian, Andrew Morton,
	netdev, bugzilla-daemon, bugme-daemon
In-Reply-To: <alpine.DEB.2.00.1105101323290.4023@router.home>

On Tue, 10 May 2011, Christoph Lameter wrote:

> > This other cpu can free the object and unmap page right after you did
> > the probe_kernel_address(object) (successfully), and before your cpu :
> >
> > p = get_freepointer(s, object); << BUG >>
>
> If the other cpu frees the object and unmaps the page then
> get_freepointer_safe() can obtain an arbitrary value since the TID was
> incremented. We will restart the loop and discard the value retrieved.

Ok. Forgot the element there of a different cpu. A different cpu cannot
unmap the page or free the page since the page is in a frozen state while
we allocate from it.

The page is only handled by the cpu it was assigned to until the cpu which
froze it releases it.

The only case that we need to protect against here is the case when an
interrupt or reschedule causes the *same* cpu to release the page. In that
case the TID must have been incremented.

^ permalink raw reply

* Re: [PATCH 2/2] net/dl2k: Don't reconfigure link @100Mbps when disabling autoneg @1Gbps
From: Ben Hutchings @ 2011-05-10 18:55 UTC (permalink / raw)
  To: David Decotigny
  Cc: Giuseppe Cavallaro, David S. Miller, Joe Perches,
	Stanislaw Gruszka, netdev, linux-kernel
In-Reply-To: <1304986748-15809-3-git-send-email-decot@google.com>

On Mon, 2011-05-09 at 17:19 -0700, David Decotigny wrote:
> The initial version of the driver used to force the link to 100Mbps
> when auto-negociation was disabled on a 1Gbps link, ignoring the
> requested link speed. Instead, this change refuses to change anything
> when it is asked to configure the link speed at 1Gbps without
> auto-negociation, but acts as requested in all the other cases.
> 
> IMPORTANT: Previously, the return value from mii_set_media() was
>            ignored. This patch uses it for its own return value.
> 
> Tested: module compiling, NOT tested on real hardware.
> Signed-off-by: David Decotigny <decot@google.com>
[...]

The changes to validation look fine.  However, I noticed that there's a
call to netif_carrier_off() at the top of this function.  This means
that in the error and shortcut cases, the interface will be left
disabled!  It's an existing bug but might be made slightly worse by this
change.

Please also move the call to netif_carrier_off() down to the end, just
before the call to mii_set_media() which actually alters the link.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [PATCH] ixgbe: RX ntuple feature must check num_rx_queues
From: Jeff Kirsher @ 2011-05-10 18:52 UTC (permalink / raw)
  To: Sebastian.Poehn; +Cc: netdev
In-Reply-To: <OFA595D9DB.234DFCEB-ON8525788C.004ED485-8525788C.004ED485@BeldenCDT.com>

On Tue, May 10, 2011 at 07:21,  <Sebastian.Poehn@belden.com> wrote:
> The driver must check how much RX queues there are, not TX queues.
>
> --- ixgbe_ethtool.c.orig    2011-05-10 16:18:00.313745560 +0200
> +++ ixgbe_ethtool.c    2011-05-10 16:18:23.285747635 +0200
> @@ -2349,9 +2349,9 @@ static int ixgbe_set_rx_ntuple(struct ne
>
>      /*
>       * Don't allow programming if the action is a queue greater than
> -     * the number of online Tx queues.
> +     * the number of online Rx queues.
>       */
> -    if ((fs->action >= adapter->num_tx_queues) ||
> +    if ((fs->action >= adapter->num_rx_queues) ||
>          (fs->action < ETHTOOL_RXNTUPLE_ACTION_DROP))
>          return -EINVAL;
>
> Signed-off-by: Sebastian Poehn <sebastian.poehn@belden.com>

Thanks Sebastian, I have added the patch to my queue of patches.

-- 
Cheers,
Jeff

^ permalink raw reply

* Re: [PATCHv2 net-next-2.6 2/2] qlcnic: Take FW dump via ethtool
From: Ben Hutchings @ 2011-05-10 18:40 UTC (permalink / raw)
  To: Anirban Chakraborty; +Cc: netdev, davem
In-Reply-To: <1304989352-24810-4-git-send-email-anirban.chakraborty@qlogic.com>

On Mon, 2011-05-09 at 18:02 -0700, Anirban Chakraborty wrote:
> Driver checks if the previous dump has been cleared before taking the dump.
> It doesn't take the dump if it is not cleared.
> 
> Signed-off-by: Anirban Chakraborty <anirban.chakraborty@qlogic.com>
> ---
>  drivers/net/qlcnic/qlcnic_ethtool.c |   60 +++++++++++++++++++++++++++++++++++
>  1 files changed, 60 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/net/qlcnic/qlcnic_ethtool.c b/drivers/net/qlcnic/qlcnic_ethtool.c
> index c541461..1237449 100644
> --- a/drivers/net/qlcnic/qlcnic_ethtool.c
> +++ b/drivers/net/qlcnic/qlcnic_ethtool.c
> @@ -965,6 +965,64 @@ static void qlcnic_set_msglevel(struct net_device *netdev, u32 msglvl)
>  	adapter->msg_enable = msglvl;
>  }
>  
> +static int
> +qlcnic_get_dump(struct net_device *netdev, struct ethtool_dump *dump,
> +		void *buffer)
> +{
> +	int i, copy_sz;
> +	u32 *hdr_ptr, *data;
> +	struct qlcnic_adapter *adapter = netdev_priv(netdev);
> +	struct qlcnic_fw_dump *fw_dump = &adapter->ahw->fw_dump;
> +
> +	if (dump->type == ETHTOOL_DUMP_FLAG) {
> +		dump->len = fw_dump->tmpl_hdr->size + fw_dump->size;
> +		dump->flag = fw_dump->tmpl_hdr->drv_cap_mask;
> +		return 0;
> +	}
> +	if (!fw_dump->clr) {
> +		netdev_info(netdev, "Dump not available\n");
> +		return -EINVAL;
> +	}
> +	copy_sz = fw_dump->tmpl_hdr->size;
> +	/* Copy template header first */
> +	hdr_ptr = (u32 *) fw_dump->tmpl_hdr;
> +	data = (u32 *) buffer;
> +	for (i = 0; i < copy_sz/sizeof(u32); i++)
> +		*data++ = cpu_to_le32(*hdr_ptr++);
> +	/* Copy captured dump data */
> +	memcpy(buffer + copy_sz, fw_dump->data, fw_dump->size);
> +	dump->len = copy_sz + fw_dump->size;
> +	dump->flag = fw_dump->tmpl_hdr->drv_cap_mask;
> +	/* free dump area once the whoel dump data has been captured */
> +	vfree(fw_dump->data);
> +	fw_dump->size = 0;
> +	fw_dump->data = NULL;
> +	fw_dump->clr = 0;

This doesn't seem to be serialised with the code that captures firmware
dumps.  They need to use the same lock!

> +	return 0;
> +}
> +
> +static int
> +qlcnic_set_dump(struct net_device *netdev, struct ethtool_dump *val)
> +{
> +	struct qlcnic_adapter *adapter = netdev_priv(netdev);
> +	struct qlcnic_fw_dump *fw_dump = &adapter->ahw->fw_dump;
> +	if (val->flag == QLCNIC_FORCE_FW_DUMP_KEY) {
> +		netdev_info(netdev, "Forcing a FW dump\n");
> +		qlcnic_dev_request_reset(adapter);
> +	} else {
> +		if (val->flag > QLCNIC_DUMP_MASK_MAX ||
> +			val->flag < QLCNIC_DUMP_MASK_MIN) {
> +				netdev_info(netdev,
> +				"Invalid dump level: 0x%x\n", val->flag);
> +				return -EINVAL;
> +		}
> +		fw_dump->tmpl_hdr->drv_cap_mask = val->flag & 0xff;
> +		netdev_info(netdev, "Driver mask changed to: 0x%x\n",
> +			fw_dump->tmpl_hdr->drv_cap_mask);

If the flags change, doesn't this invalidate any dump that has been
collected by the driver but not saved?

Also, same locking problem here.

Ben.

> +	}
> +	return 0;
> +}
> +
>  const struct ethtool_ops qlcnic_ethtool_ops = {
>  	.get_settings = qlcnic_get_settings,
>  	.set_settings = qlcnic_set_settings,
> @@ -991,4 +1049,6 @@ const struct ethtool_ops qlcnic_ethtool_ops = {
>  	.set_phys_id = qlcnic_set_led,
>  	.set_msglevel = qlcnic_set_msglevel,
>  	.get_msglevel = qlcnic_get_msglevel,
> +	.get_dump = qlcnic_get_dump,
> +	.set_dump = qlcnic_set_dump,
>  };

-- 
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: [PATCHv2 net-next-2.6] ethtool: Added support for FW dump
From: Ben Hutchings @ 2011-05-10 18:29 UTC (permalink / raw)
  To: Anirban Chakraborty; +Cc: netdev, davem
In-Reply-To: <1304989352-24810-2-git-send-email-anirban.chakraborty@qlogic.com>

On Mon, 2011-05-09 at 18:02 -0700, Anirban Chakraborty wrote:
> Added code to take FW dump via ethtool. A pair of set and get functions are
> added to configure dump level and fetch dump data from the driver respectively.

I don't understand why you are combining get-flags and get-data in the
driver interface.  I suggested that you could use a single option for
these in the ethtool *utility*, but combining them in the driver
interface just seems to complicate the implementation of the ethtool
core code and the driver.

> Signed-off-by: Anirban Chakraborty <anirban.chakraborty@qlogic.com>
> ---
>  include/linux/ethtool.h |   31 +++++++++++++++++++++++
>  net/core/ethtool.c      |   62 +++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 93 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
> index bd0b50b..f2cd7e1 100644
> --- a/include/linux/ethtool.h
> +++ b/include/linux/ethtool.h
> @@ -601,6 +601,31 @@ struct ethtool_flash {
>  	char	data[ETHTOOL_FLASH_MAX_FILENAME];
>  };
>  
> +/**
> + * struct ethtool_dump - used for retrieving, setting device dump

Missing description of @cmd.

> + * @type: type of operation, get dump settings or data
> + * @version: FW version of the dump
> + * @flag: flag for dump setting
> + * @len: length of dump data
> + * @data: data collected for this command

The kernel-doc needs to describe when the fields are valid - i.e. which
commands use them as input and/or output.

> + */
> +struct ethtool_dump {
> +	__u32	cmd;
> +	__u32	type;
> +	__u32	version;
> +	__u32	flag;
> +	__u32	len;
> +	u8	data[0];
> +};
> +
> +/*
> + * ethtool_dump_op_type - used to determine type of fetch, flag or data
> + */
> +enum ethtool_dump_op_type {
> +	ETHTOOL_DUMP_FLAG	= 0,
> +	ETHTOOL_DUMP_DATA,
> +};
> +
>  /* for returning and changing feature sets */
>  
>  /**
> @@ -853,6 +878,8 @@ bool ethtool_invalid_flags(struct net_device *dev, u32 data, u32 supported);
>   * @get_channels: Get number of channels.
>   * @set_channels: Set number of channels.  Returns a negative error code or
>   *	zero.
> + * @get_dump: Get dump flag indicating current dump settings of the device
> + * @set_dump: Set dump specific flags to the device
>   *
>   * All operations are optional (i.e. the function pointer may be set
>   * to %NULL) and callers must take this into account.  Callers must
> @@ -927,6 +954,8 @@ struct ethtool_ops {
>  				  const struct ethtool_rxfh_indir *);
>  	void	(*get_channels)(struct net_device *, struct ethtool_channels *);
>  	int	(*set_channels)(struct net_device *, struct ethtool_channels *);
> +	int	(*get_dump)(struct net_device *, struct ethtool_dump *, void *);
> +	int	(*set_dump)(struct net_device *, struct ethtool_dump *);
>  
>  };
>  #endif /* __KERNEL__ */
> @@ -998,6 +1027,8 @@ struct ethtool_ops {
>  #define ETHTOOL_SFEATURES	0x0000003b /* Change device offload settings */
>  #define ETHTOOL_GCHANNELS	0x0000003c /* Get no of channels */
>  #define ETHTOOL_SCHANNELS	0x0000003d /* Set no of channels */
> +#define ETHTOOL_SET_DUMP	0x0000003e /* Set dump settings */
> +#define ETHTOOL_GET_DUMP	0x0000003f /* Get dump settings */
>  
>  /* compatibility with older code */
>  #define SPARC_ETH_GSET		ETHTOOL_GSET
> diff --git a/net/core/ethtool.c b/net/core/ethtool.c
> index b6f4058..3c3af8b 100644
> --- a/net/core/ethtool.c
> +++ b/net/core/ethtool.c
> @@ -1823,6 +1823,62 @@ static noinline_for_stack int ethtool_flash_device(struct net_device *dev,
>  	return dev->ethtool_ops->flash_device(dev, &efl);
>  }
>  
> +static int ethtool_set_dump(struct net_device *dev,
> +			void __user *useraddr)
> +{
> +	struct ethtool_dump dump;
> +
> +	if (!dev->ethtool_ops->set_dump)
> +		return -EOPNOTSUPP;
> +
> +	if (copy_from_user(&dump, useraddr, sizeof(dump)))
> +		return -EFAULT;
> +
> +	return dev->ethtool_ops->set_dump(dev, &dump);
> +}
> +
> +static int ethtool_get_dump(struct net_device *dev,
> +				void __user *useraddr)
> +{
> +	int ret;
> +	void *data = NULL;
> +	struct ethtool_dump dump;
> +	const struct ethtool_ops *ops = dev->ethtool_ops;
> +	enum ethtool_dump_op_type type;
> +
> +	if (!dev->ethtool_ops->get_dump)
> +		return -EOPNOTSUPP;
> +
> +	if (copy_from_user(&dump, useraddr, sizeof(dump)))
> +		return -EFAULT;
> +
> +	type = dump.type;
> +	dump.type = ETHTOOL_DUMP_FLAG;
> +	ret = ops->get_dump(dev, &dump, data);
> +	if (ret)
> +		return ret;
> +	dump.type = type;
> +	if (copy_to_user(useraddr, &dump, sizeof(dump)))
> +		return -EFAULT;
> +	if (type != ETHTOOL_DUMP_DATA)
> +		return 0;
> +
> +	data = vzalloc(dump.len);
> +	if (!data)
> +		return -ENOMEM;
> +	ret = ops->get_dump(dev, &dump, data);
> +	if (ret) {
> +		ret = -EFAULT;

There is no reason to change ret here.  Didn't I already raise this in
version 1?

> +		goto out;
> +	}
> +	useraddr += offsetof(struct ethtool_dump, data);
> +	if (copy_to_user(useraddr, data, dump.len))

The copied length should be the *minimum* of the user's buffer length
and the actual dump length.

Ben.

> +		ret = -EFAULT;
> +out:
> +	vfree(data);
> +	return ret;
> +}
> +
>  /* The main entry point in this file.  Called from net/core/dev.c */
>  
>  int dev_ethtool(struct net *net, struct ifreq *ifr)
> @@ -2039,6 +2095,12 @@ int dev_ethtool(struct net *net, struct ifreq *ifr)
>  	case ETHTOOL_SCHANNELS:
>  		rc = ethtool_set_channels(dev, useraddr);
>  		break;
> +	case ETHTOOL_SET_DUMP:
> +		rc = ethtool_set_dump(dev, useraddr);
> +		break;
> +	case ETHTOOL_GET_DUMP:
> +		rc = ethtool_get_dump(dev, useraddr);
> +		break;
>  	default:
>  		rc = -EOPNOTSUPP;
>  	}

-- 
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: [Bugme-new] [Bug 33502] New: Caught 64-bit read from uninitialized memory in __alloc_skb
From: Christoph Lameter @ 2011-05-10 18:28 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Vegard Nossum, Pekka Enberg, casteyde.christian, Andrew Morton,
	netdev, bugzilla-daemon, bugme-daemon
In-Reply-To: <1305050754.2758.12.camel@edumazet-laptop>

On Tue, 10 May 2011, Eric Dumazet wrote:

> > +	else
> > +		p = get_freepointer(s, object);
> > +
> > +	local_irq_restore(flags);
> > +#else
> > +	p = get_freepointer(s, object);
> > +#endif
> > +	return p;
> > +}
> > +
> >  static inline void set_freepointer(struct kmem_cache *s, void *object, void *fp)
> >  {
> >  	*(void **)(object + s->offset) = fp;
> > @@ -1933,7 +1954,7 @@ redo:
> >  		if (unlikely(!irqsafe_cpu_cmpxchg_double(
> >  				s->cpu_slab->freelist, s->cpu_slab->tid,
> >  				object, tid,
> > -				get_freepointer(s, object), next_tid(tid)))) {
> > +				get_freepointer_safe(s, object), next_tid(tid)))) {
> >
> >  			note_cmpxchg_failure("slab_alloc", s, tid);
> >  			goto redo;
>
>
> Really this wont work Stephen

I am not Stephen.

> You have to disable IRQ _before_ even fetching 'object'

The object pointer is being obtained from a per cpu structure and
not from the page. What is the problem with fetching the object pointer?

> Or else, you can have an IRQ, allocate this object, pass to another cpu.

If that occurs then TID is being incremented and we will restart the loop
getting the new object pointer from the per cpu structure. The object
pointer that we were considering is irrelevant.

> This other cpu can free the object and unmap page right after you did
> the probe_kernel_address(object) (successfully), and before your cpu :
>
> p = get_freepointer(s, object); << BUG >>

If the other cpu frees the object and unmaps the page then
get_freepointer_safe() can obtain an arbitrary value since the TID was
incremented. We will restart the loop and discard the value retrieved.


^ permalink raw reply

* [PATCH v2] net: ipv4: add IPPROTO_ICMP socket kind
From: Vasiliy Kulikov @ 2011-05-10 18:09 UTC (permalink / raw)
  To: David Miller
  Cc: solar, linux-kernel, netdev, peak, kees.cook, dan.j.rosenberg,
	eugene, nelhage, kuznet, pekkas, jmorris, yoshfuji, kaber
In-Reply-To: <20110412.142534.183049889.davem@davemloft.net>

This patch adds IPPROTO_ICMP socket kind.  It makes it possible to send
ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
without any special privileges.  In other words, the patch makes it
possible to implement setuid-less and CAP_NET_RAW-less /bin/ping.  In
order not to increase the kernel's attack surface (in case of
vulnerabilities in the newly added code), the new functionality is
disabled by default, but is enabled at bootup by supporting Linux
distributions, optionally with restriction to a group or a group range
(see below).

Similar functionality is implemented in Mac OS X:
http://www.manpagez.com/man/4/icmp/

A new ping socket is created with

    socket(PF_INET, SOCK_DGRAM, PROT_ICMP)

Message identifiers (octets 4-5 of ICMP header) are interpreted as local
ports. Addresses are stored in struct sockaddr_in. No port numbers are
reserved for privileged processes, port 0 is reserved for API ("let the
kernel pick a free number"). There is no notion of remote ports, remote
port numbers provided by the user (e.g. in connect()) are ignored.

Data sent and received include ICMP headers. This is deliberate to:
1) Avoid the need to transport headers values like sequence numbers by
other means.
2) Make it easier to port existing programs using raw sockets.

ICMP headers given to send() are checked and sanitized. The type must be
ICMP_ECHO and the code must be zero (future extensions might relax this,
see below). The id is set to the number (local port) of the socket, the
checksum is always recomputed.

ICMP reply packets received from the network are demultiplexed according
to their id's, and are returned by recv() without any modifications.
IP header information and ICMP errors of those packets may be obtained
via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
quenches and redirects are reported as fake errors via the error queue
(IP_RECVERR); the next hop address for redirects is saved to ee_info (in
network order).

socket(2) is restricted to the group range specified in
"/proc/sys/net/ipv4/ping_group_range".  It is "1 0" by default, meaning
that nobody (not even root) may create ping sockets.  Setting it to "100
100" would grant permissions to the single group (to either make
/sbin/ping g+s and owned by this group or to grant permissions to the
"netadmins" group), "0 4294967295" would enable it for the world, "100
4294967295" would enable it for the users, but not daemons.

The existing code might be (in the unlikely case anyone needs it)
extended rather easily to handle other similar pairs of ICMP messages
(Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
etc.).

Userspace ping util & patch for it:
http://openwall.info/wiki/people/segoon/ping

For Openwall GNU/*/Linux it was the last step on the road to the
setuid-less distro.  A revision of this patch (for RHEL5/OpenVZ kernels)
is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
http://mirrors.kernel.org/openwall/Owl/current/iso/

Initially this functionality was written by Pavel Kankovsky for
Linux 2.4.32, but unfortunately it was never made public.

All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
the patch.

PATCH v2:
    - changed ping_debug() to pr_debug().
    - removed CONFIG_IP_PING.
    - removed ping_seq_fops.owner field (unused for procfs).
    - switched to proc_net_fops_create().
    - switched to %pK.

PATCH v1:
    - fixed checksumming bug.
    - CAP_NET_RAW may not create icmp sockets anymore.

RFC v2:
    - minor cleanups.
    - introduced sysctl'able group range to restrict socket(2).

Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Acked-by: Solar Designer <solar@openwall.com>
---
 include/net/netns/ipv4.h   |    2 +
 include/net/ping.h         |   57 +++
 net/ipv4/Makefile          |    2 +-
 net/ipv4/af_inet.c         |   22 +
 net/ipv4/icmp.c            |   12 +-
 net/ipv4/ping.c            |  929 ++++++++++++++++++++++++++++++++++++++++++++
 net/ipv4/sysctl_net_ipv4.c |   80 ++++
 7 files changed, 1102 insertions(+), 2 deletions(-)
 create mode 100644 include/net/ping.h
 create mode 100644 net/ipv4/ping.c

diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index d68c3f1..ff3bb61 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -55,6 +55,8 @@ struct netns_ipv4 {
 	int sysctl_rt_cache_rebuild_count;
 	int current_rt_cache_rebuild_count;
 
+	unsigned int sysctl_ping_group_range[2];
+
 	atomic_t rt_genid;
 
 #ifdef CONFIG_IP_MROUTE
diff --git a/include/net/ping.h b/include/net/ping.h
new file mode 100644
index 0000000..23062c3
--- /dev/null
+++ b/include/net/ping.h
@@ -0,0 +1,57 @@
+/*
+ * INET		An implementation of the TCP/IP protocol suite for the LINUX
+ *		operating system.  INET is implemented using the  BSD Socket
+ *		interface as the means of communication with the user level.
+ *
+ *		Definitions for the "ping" module.
+ *
+ *		This program is free software; you can redistribute it and/or
+ *		modify it under the terms of the GNU General Public License
+ *		as published by the Free Software Foundation; either version
+ *		2 of the License, or (at your option) any later version.
+ */
+#ifndef _PING_H
+#define _PING_H
+
+#include <net/netns/hash.h>
+
+/* PING_HTABLE_SIZE must be power of 2 */
+#define PING_HTABLE_SIZE 	64
+#define PING_HTABLE_MASK 	(PING_HTABLE_SIZE-1)
+
+#define ping_portaddr_for_each_entry(__sk, node, list) \
+	hlist_nulls_for_each_entry(__sk, node, list, sk_nulls_node)
+
+/*
+ * gid_t is either uint or ushort.  We want to pass it to
+ * proc_dointvec_minmax(), so it must not be larger than MAX_INT
+ */
+#define GID_T_MAX (((gid_t)~0U) >> 1)
+
+struct ping_table {
+	struct hlist_nulls_head	hash[PING_HTABLE_SIZE];
+	rwlock_t		lock;
+};
+
+struct ping_iter_state {
+	struct seq_net_private  p;
+	int			bucket;
+};
+
+extern struct proto ping_prot;
+
+
+extern void ping_rcv(struct sk_buff *);
+extern void ping_err(struct sk_buff *, u32 info);
+
+extern void inet_get_ping_group_range_net(struct net *net, unsigned int *low, unsigned int *high);
+
+#ifdef CONFIG_PROC_FS
+extern int __init ping_proc_init(void);
+extern void ping_proc_exit(void);
+#endif
+
+void __init ping_init(void);
+
+
+#endif /* _PING_H */
diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
index 4978d22..01b0349 100644
--- a/net/ipv4/Makefile
+++ b/net/ipv4/Makefile
@@ -11,7 +11,7 @@ obj-y     := route.o inetpeer.o protocol.o \
 	     datagram.o raw.o udp.o udplite.o \
 	     arp.o icmp.o devinet.o af_inet.o  igmp.o \
 	     fib_frontend.o fib_semantics.o \
-	     inet_fragment.o
+	     inet_fragment.o ping.o
 
 obj-$(CONFIG_SYSCTL) += sysctl_net_ipv4.o
 obj-$(CONFIG_IP_FIB_HASH) += fib_hash.o
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 45b89d7..d2b225e 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -105,6 +105,7 @@
 #include <net/tcp.h>
 #include <net/udp.h>
 #include <net/udplite.h>
+#include <net/ping.h>
 #include <linux/skbuff.h>
 #include <net/sock.h>
 #include <net/raw.h>
@@ -1008,6 +1009,14 @@ static struct inet_protosw inetsw_array[] =
 		.flags =      INET_PROTOSW_PERMANENT,
        },
 
+       {
+		.type =       SOCK_DGRAM,
+		.protocol =   IPPROTO_ICMP,
+		.prot =       &ping_prot,
+		.ops =        &inet_dgram_ops,
+		.no_check =   UDP_CSUM_DEFAULT,
+		.flags =      INET_PROTOSW_REUSE,
+       },
 
        {
 	       .type =       SOCK_RAW,
@@ -1528,6 +1537,7 @@ static const struct net_protocol udp_protocol = {
 
 static const struct net_protocol icmp_protocol = {
 	.handler =	icmp_rcv,
+	.err_handler =	ping_err,
 	.no_policy =	1,
 	.netns_ok =	1,
 };
@@ -1643,6 +1653,10 @@ static int __init inet_init(void)
 	if (rc)
 		goto out_unregister_udp_proto;
 
+	rc = proto_register(&ping_prot, 1);
+	if (rc)
+		goto out_unregister_raw_proto;
+
 	/*
 	 *	Tell SOCKET that we are alive...
 	 */
@@ -1698,6 +1712,8 @@ static int __init inet_init(void)
 	/* Add UDP-Lite (RFC 3828) */
 	udplite4_register();
 
+	ping_init();
+
 	/*
 	 *	Set the ICMP layer up
 	 */
@@ -1728,6 +1744,8 @@ static int __init inet_init(void)
 	rc = 0;
 out:
 	return rc;
+out_unregister_raw_proto:
+	proto_unregister(&raw_prot);
 out_unregister_udp_proto:
 	proto_unregister(&udp_prot);
 out_unregister_tcp_proto:
@@ -1752,11 +1770,15 @@ static int __init ipv4_proc_init(void)
 		goto out_tcp;
 	if (udp4_proc_init())
 		goto out_udp;
+	if (ping_proc_init())
+		goto out_ping;
 	if (ip_misc_proc_init())
 		goto out_misc;
 out:
 	return rc;
 out_misc:
+	ping_proc_exit();
+out_ping:
 	udp4_proc_exit();
 out_udp:
 	tcp4_proc_exit();
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index 4aa1b7f..51e5c41 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -83,6 +83,7 @@
 #include <net/tcp.h>
 #include <net/udp.h>
 #include <net/raw.h>
+#include <net/ping.h>
 #include <linux/skbuff.h>
 #include <net/sock.h>
 #include <linux/errno.h>
@@ -798,6 +799,15 @@ static void icmp_redirect(struct sk_buff *skb)
 			       iph->saddr, skb->dev);
 		break;
 	}
+
+	/* Ping wants to see redirects.
+         * Let's pretend they are errors of sorts... */
+	if (iph->protocol == IPPROTO_ICMP &&
+	    iph->ihl >= 5 &&
+	    pskb_may_pull(skb, (iph->ihl<<2)+8)) {
+		ping_err(skb, icmp_hdr(skb)->un.gateway);
+	}
+
 out:
 	return;
 out_err:
@@ -1058,7 +1068,7 @@ error:
  */
 static const struct icmp_control icmp_pointers[NR_ICMP_TYPES + 1] = {
 	[ICMP_ECHOREPLY] = {
-		.handler = icmp_discard,
+		.handler = ping_rcv,
 	},
 	[1] = {
 		.handler = icmp_discard,
diff --git a/net/ipv4/ping.c b/net/ipv4/ping.c
new file mode 100644
index 0000000..e81ec6c
--- /dev/null
+++ b/net/ipv4/ping.c
@@ -0,0 +1,929 @@
+/*
+ * INET		An implementation of the TCP/IP protocol suite for the LINUX
+ *		operating system.  INET is implemented using the  BSD Socket
+ *		interface as the means of communication with the user level.
+ *
+ *		"Ping" sockets
+ *
+ *		This program is free software; you can redistribute it and/or
+ *		modify it under the terms of the GNU General Public License
+ *		as published by the Free Software Foundation; either version
+ *		2 of the License, or (at your option) any later version.
+ *
+ * Based on ipv4/udp.c code.
+ *
+ * Authors:	Vasiliy Kulikov / Openwall (for Linux 2.6),
+ *		Pavel Kankovsky (for Linux 2.4.32)
+ *
+ * Pavel gave all rights to bugs to Vasiliy,
+ * none of the bugs are Pavel's now.
+ *
+ */
+
+#include <asm/system.h>
+#include <linux/uaccess.h>
+#include <asm/ioctls.h>
+#include <linux/types.h>
+#include <linux/fcntl.h>
+#include <linux/socket.h>
+#include <linux/sockios.h>
+#include <linux/in.h>
+#include <linux/errno.h>
+#include <linux/timer.h>
+#include <linux/mm.h>
+#include <linux/inet.h>
+#include <linux/netdevice.h>
+#include <net/snmp.h>
+#include <net/ip.h>
+#include <net/ipv6.h>
+#include <net/icmp.h>
+#include <net/protocol.h>
+#include <linux/skbuff.h>
+#include <linux/proc_fs.h>
+#include <net/sock.h>
+#include <net/ping.h>
+#include <net/icmp.h>
+#include <net/udp.h>
+#include <net/route.h>
+#include <net/inet_common.h>
+#include <net/checksum.h>
+
+
+struct ping_table ping_table __read_mostly;
+
+u16 ping_port_rover;
+
+static inline int ping_hashfn(struct net *net, unsigned num, unsigned mask)
+{
+	int res = (num + net_hash_mix(net)) & mask;
+	pr_debug("hash(%d) = %d\n", num, res);
+	return res;
+}
+
+static inline struct hlist_nulls_head *ping_hashslot(struct ping_table *table,
+					     struct net *net, unsigned num)
+{
+	return &table->hash[ping_hashfn(net, num, PING_HTABLE_MASK)];
+}
+
+static int ping_v4_get_port(struct sock *sk, unsigned short ident)
+{
+	struct hlist_nulls_node *node;
+	struct hlist_nulls_head *hlist;
+	struct inet_sock *isk, *isk2;
+	struct sock *sk2 = NULL;
+
+	isk = inet_sk(sk);
+	write_lock_bh(&ping_table.lock);
+	if (ident == 0) {
+		u32 i;
+		u16 result = ping_port_rover + 1;
+
+		for (i = 0; i < (1L << 16); i++, result++) {
+			if (!result)
+				result++; /* avoid zero */
+			hlist = ping_hashslot(&ping_table, sock_net(sk),
+					    result);
+			ping_portaddr_for_each_entry(sk2, node, hlist) {
+				isk2 = inet_sk(sk2);
+
+				if (isk2->inet_num == result)
+					goto next_port;
+			}
+
+			/* found */
+			ping_port_rover = ident = result;
+			break;
+next_port:
+			;
+		}
+		if (i >= (1L << 16))
+			goto fail;
+	} else {
+		hlist = ping_hashslot(&ping_table, sock_net(sk), ident);
+		ping_portaddr_for_each_entry(sk2, node, hlist) {
+			isk2 = inet_sk(sk2);
+
+			if ((isk2->inet_num == ident) &&
+			    (sk2 != sk) &&
+			    (!sk2->sk_reuse || !sk->sk_reuse))
+				goto fail;
+		}
+	}
+
+	pr_debug("found port/ident = %d\n", ident);
+	isk->inet_num = ident;
+	if (sk_unhashed(sk)) {
+		pr_debug("was not hashed\n");
+		sock_hold(sk);
+		hlist_nulls_add_head(&sk->sk_nulls_node, hlist);
+		sock_prot_inuse_add(sock_net(sk), sk->sk_prot, 1);
+	}
+	write_unlock_bh(&ping_table.lock);
+	return 0;
+
+fail:
+	write_unlock_bh(&ping_table.lock);
+	return 1;
+}
+
+static void ping_v4_hash(struct sock *sk)
+{
+	pr_debug("ping_v4_hash(sk->port=%u)\n", inet_sk(sk)->inet_num);
+	BUG(); /* "Please do not press this button again." */
+}
+
+static void ping_v4_unhash(struct sock *sk)
+{
+	struct inet_sock *isk = inet_sk(sk);
+	pr_debug("ping_v4_unhash(isk=%p,isk->num=%u)\n", isk, isk->inet_num);
+	if (sk_hashed(sk)) {
+		struct hlist_nulls_head *hslot;
+
+		hslot = ping_hashslot(&ping_table, sock_net(sk), isk->inet_num);
+		write_lock_bh(&ping_table.lock);
+		hlist_nulls_del(&sk->sk_nulls_node);
+		sock_put(sk);
+		isk->inet_num = isk->inet_sport = 0;
+		sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
+		write_unlock_bh(&ping_table.lock);
+	}
+}
+
+struct sock *ping_v4_lookup(struct net *net, u32 saddr, u32 daddr,
+	 u16 ident, int dif)
+{
+	struct hlist_nulls_head *hslot = ping_hashslot(&ping_table, net, ident);
+	struct sock *sk = NULL;
+	struct inet_sock *isk;
+	struct hlist_nulls_node *hnode;
+
+	pr_debug("try to find: num = %d, daddr = %ld, dif = %d\n",
+			 (int)ident, (unsigned long)daddr, dif);
+	read_lock_bh(&ping_table.lock);
+
+	ping_portaddr_for_each_entry(sk, hnode, hslot) {
+		isk = inet_sk(sk);
+
+		pr_debug("found: %p: num = %d, daddr = %ld, dif = %d\n", sk,
+			 (int)isk->inet_num, (unsigned long)isk->inet_rcv_saddr,
+			 sk->sk_bound_dev_if);
+
+		pr_debug("iterate\n");
+		if (isk->inet_num != ident)
+			continue;
+		if (isk->inet_rcv_saddr && isk->inet_rcv_saddr != daddr)
+			continue;
+		if (sk->sk_bound_dev_if && sk->sk_bound_dev_if != dif)
+			continue;
+
+		sock_hold(sk);
+		goto exit;
+	}
+
+	sk = NULL;
+exit:
+	read_unlock_bh(&ping_table.lock);
+
+	return sk;
+}
+
+static int ping_init_sock(struct sock *sk)
+{
+	struct net *net = sock_net(sk);
+	gid_t group = current_egid();
+	gid_t range[2];
+	struct group_info *group_info = get_current_groups();
+	int i, j, count = group_info->ngroups;
+
+	inet_get_ping_group_range_net(net, range, range+1);
+	if (range[0] <= group && group <= range[1])
+		return 0;
+
+	for (i = 0; i < group_info->nblocks; i++) {
+		int cp_count = min_t(int, NGROUPS_PER_BLOCK, count);
+
+		for (j = 0; j < cp_count; j++) {
+			group = group_info->blocks[i][j];
+			if (range[0] <= group && group <= range[1])
+				return 0;
+		}
+
+		count -= cp_count;
+	}
+
+	return -EACCES;
+}
+
+static void ping_close(struct sock *sk, long timeout)
+{
+	pr_debug("ping_close(sk=%p,sk->num=%u)\n",
+		inet_sk(sk), inet_sk(sk)->inet_num);
+	pr_debug("isk->refcnt = %d\n", sk->sk_refcnt.counter);
+
+	sk_common_release(sk);
+}
+
+/*
+ * We need our own bind because there are no privileged id's == local ports.
+ * Moreover, we don't allow binding to multi- and broadcast addresses.
+ */
+
+static int ping_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len)
+{
+	struct sockaddr_in *addr = (struct sockaddr_in *)uaddr;
+	struct inet_sock *isk = inet_sk(sk);
+	unsigned short snum;
+	int chk_addr_ret;
+	int err;
+
+	if (addr_len < sizeof(struct sockaddr_in))
+		return -EINVAL;
+
+	pr_debug("ping_v4_bind(sk=%p,sa_addr=%08x,sa_port=%d)\n",
+		sk, addr->sin_addr.s_addr, ntohs(addr->sin_port));
+
+	chk_addr_ret = inet_addr_type(sock_net(sk), addr->sin_addr.s_addr);
+	if (addr->sin_addr.s_addr == INADDR_ANY)
+		chk_addr_ret = RTN_LOCAL;
+
+	if ((sysctl_ip_nonlocal_bind == 0 &&
+	    isk->freebind == 0 && isk->transparent == 0 &&
+	     chk_addr_ret != RTN_LOCAL) ||
+	    chk_addr_ret == RTN_MULTICAST ||
+	    chk_addr_ret == RTN_BROADCAST)
+		return -EADDRNOTAVAIL;
+
+	lock_sock(sk);
+
+	err = -EINVAL;
+	if (isk->inet_num != 0)
+		goto out;
+
+	err = -EADDRINUSE;
+	isk->inet_rcv_saddr = isk->inet_saddr = addr->sin_addr.s_addr;
+	snum = ntohs(addr->sin_port);
+	if (ping_v4_get_port(sk, snum) != 0) {
+		isk->inet_saddr = isk->inet_rcv_saddr = 0;
+		goto out;
+	}
+
+	pr_debug("after bind(): num = %d, daddr = %ld, dif = %d\n",
+		(int)isk->inet_num,
+		(unsigned long) isk->inet_rcv_saddr,
+		(int)sk->sk_bound_dev_if);
+
+	err = 0;
+	if (isk->inet_rcv_saddr)
+		sk->sk_userlocks |= SOCK_BINDADDR_LOCK;
+	if (snum)
+		sk->sk_userlocks |= SOCK_BINDPORT_LOCK;
+	isk->inet_sport = htons(isk->inet_num);
+	isk->inet_daddr = 0;
+	isk->inet_dport = 0;
+	sk_dst_reset(sk);
+out:
+	release_sock(sk);
+	pr_debug("ping_v4_bind -> %d\n", err);
+	return err;
+}
+
+/*
+ * Is this a supported type of ICMP message?
+ */
+
+static inline int ping_supported(int type, int code)
+{
+	if (type == ICMP_ECHO && code == 0)
+		return 1;
+	return 0;
+}
+
+/*
+ * This routine is called by the ICMP module when it gets some
+ * sort of error condition.
+ */
+
+static int ping_queue_rcv_skb(struct sock *sk, struct sk_buff *skb);
+
+void ping_err(struct sk_buff *skb, u32 info)
+{
+	struct iphdr *iph = (struct iphdr *)skb->data;
+	struct icmphdr *icmph = (struct icmphdr *)(skb->data+(iph->ihl<<2));
+	struct inet_sock *inet_sock;
+	int type = icmph->type;
+	int code = icmph->code;
+	struct net *net = dev_net(skb->dev);
+	struct sock *sk;
+	int harderr;
+	int err;
+
+	/* We assume the packet has already been checked by icmp_unreach */
+
+	if (!ping_supported(icmph->type, icmph->code))
+		return;
+
+	pr_debug("ping_err(type=%04x,code=%04x,id=%04x,seq=%04x)\n", type,
+		code, ntohs(icmph->un.echo.id), ntohs(icmph->un.echo.sequence));
+
+	sk = ping_v4_lookup(net, iph->daddr, iph->saddr,
+			    ntohs(icmph->un.echo.id), skb->dev->ifindex);
+	if (sk == NULL) {
+		ICMP_INC_STATS_BH(net, ICMP_MIB_INERRORS);
+		pr_debug("no socket, dropping\n");
+		return;	/* No socket for error */
+	}
+	pr_debug("err on socket %p\n", sk);
+
+	err = 0;
+	harderr = 0;
+	inet_sock = inet_sk(sk);
+
+	switch (type) {
+	default:
+	case ICMP_TIME_EXCEEDED:
+		err = EHOSTUNREACH;
+		break;
+	case ICMP_SOURCE_QUENCH:
+		/* This is not a real error but ping wants to see it.
+		 * Report it with some fake errno. */
+		err = EREMOTEIO;
+		break;
+	case ICMP_PARAMETERPROB:
+		err = EPROTO;
+		harderr = 1;
+		break;
+	case ICMP_DEST_UNREACH:
+		if (code == ICMP_FRAG_NEEDED) { /* Path MTU discovery */
+			if (inet_sock->pmtudisc != IP_PMTUDISC_DONT) {
+				err = EMSGSIZE;
+				harderr = 1;
+				break;
+			}
+			goto out;
+		}
+		err = EHOSTUNREACH;
+		if (code <= NR_ICMP_UNREACH) {
+			harderr = icmp_err_convert[code].fatal;
+			err = icmp_err_convert[code].errno;
+		}
+		break;
+	case ICMP_REDIRECT:
+		/* See ICMP_SOURCE_QUENCH */
+		err = EREMOTEIO;
+		break;
+	}
+
+	/*
+	 *      RFC1122: OK.  Passes ICMP errors back to application, as per
+	 *	4.1.3.3.
+	 */
+	if (!inet_sock->recverr) {
+		if (!harderr || sk->sk_state != TCP_ESTABLISHED)
+			goto out;
+	} else {
+		ip_icmp_error(sk, skb, err, 0 /* no remote port */,
+			 info, (u8 *)icmph);
+	}
+	sk->sk_err = err;
+	sk->sk_error_report(sk);
+out:
+	sock_put(sk);
+}
+
+/*
+ *	Copy and checksum an ICMP Echo packet from user space into a buffer.
+ */
+
+struct pingfakehdr {
+	struct icmphdr icmph;
+	struct iovec *iov;
+	u32 wcheck;
+};
+
+static int ping_getfrag(void *from, char * to,
+			int offset, int fraglen, int odd, struct sk_buff *skb)
+{
+	struct pingfakehdr *pfh = (struct pingfakehdr *)from;
+
+	if (offset == 0) {
+		if (fraglen < sizeof(struct icmphdr))
+			BUG();
+		if (csum_partial_copy_fromiovecend(to + sizeof(struct icmphdr),
+			    pfh->iov, 0, fraglen - sizeof(struct icmphdr),
+			    &pfh->wcheck))
+			return -EFAULT;
+
+		return 0;
+	}
+	if (offset < sizeof(struct icmphdr))
+		BUG();
+	if (csum_partial_copy_fromiovecend
+			(to, pfh->iov, offset - sizeof(struct icmphdr),
+			 fraglen, &pfh->wcheck))
+		return -EFAULT;
+	return 0;
+}
+
+static int ping_push_pending_frames(struct sock *sk, struct pingfakehdr *pfh)
+{
+	struct sk_buff *skb = skb_peek(&sk->sk_write_queue);
+
+	pfh->wcheck = csum_partial((char *)&pfh->icmph,
+		sizeof(struct icmphdr), pfh->wcheck);
+	pfh->icmph.checksum = csum_fold(pfh->wcheck);
+	memcpy(icmp_hdr(skb), &pfh->icmph, sizeof(struct icmphdr));
+	skb->ip_summed = CHECKSUM_NONE;
+	return ip_push_pending_frames(sk);
+}
+
+int ping_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
+		 size_t len)
+{
+	struct inet_sock *isk = inet_sk(sk);
+	struct ipcm_cookie ipc;
+	struct icmphdr user_icmph;
+	struct pingfakehdr pfh;
+	struct rtable *rt = NULL;
+	int free = 0;
+	u32 saddr, daddr;
+	u8  tos;
+	int err;
+
+	pr_debug("ping_sendmsg(sk=%p,sk->num=%u)\n", isk, isk->inet_num);
+
+
+	if (len > 0xFFFF)
+		return -EMSGSIZE;
+
+	/*
+	 *	Check the flags.
+	 */
+
+	/* Mirror BSD error message compatibility */
+	if (msg->msg_flags & MSG_OOB)
+		return -EOPNOTSUPP;
+
+	/*
+	 *	Fetch the ICMP header provided by the userland.
+	 *	iovec is modified!
+	 */
+
+	if (memcpy_fromiovec((u8 *)&user_icmph, msg->msg_iov,
+			     sizeof(struct icmphdr)))
+		return -EFAULT;
+	if (!ping_supported(user_icmph.type, user_icmph.code))
+		return -EINVAL;
+
+	/*
+	 *	Get and verify the address.
+	 */
+
+	if (msg->msg_name) {
+		struct sockaddr_in *usin = (struct sockaddr_in *)msg->msg_name;
+		if (msg->msg_namelen < sizeof(*usin))
+			return -EINVAL;
+		if (usin->sin_family != AF_INET)
+			return -EINVAL;
+		daddr = usin->sin_addr.s_addr;
+		/* no remote port */
+	} else {
+		if (sk->sk_state != TCP_ESTABLISHED)
+			return -EDESTADDRREQ;
+		daddr = isk->inet_daddr;
+		/* no remote port */
+	}
+
+	ipc.addr = isk->inet_saddr;
+	ipc.opt = NULL;
+	ipc.oif = sk->sk_bound_dev_if;
+
+	if (msg->msg_controllen) {
+		err = ip_cmsg_send(sock_net(sk), msg, &ipc);
+		if (err)
+			return err;
+		if (ipc.opt)
+			free = 1;
+	}
+	if (!ipc.opt)
+		ipc.opt = isk->opt;
+
+	saddr = ipc.addr;
+	ipc.addr = daddr;
+
+	if (ipc.opt && ipc.opt->srr) {
+		if (!daddr)
+			return -EINVAL;
+		daddr = ipc.opt->faddr;
+	}
+	tos = RT_TOS(isk->tos);
+	if (sock_flag(sk, SOCK_LOCALROUTE) ||
+	    (msg->msg_flags&MSG_DONTROUTE) ||
+	    (ipc.opt && ipc.opt->is_strictroute)) {
+		tos |= RTO_ONLINK;
+	}
+
+	if (ipv4_is_multicast(daddr)) {
+		if (!ipc.oif)
+			ipc.oif = isk->mc_index;
+		if (!saddr)
+			saddr = isk->mc_addr;
+	}
+
+	{
+		struct flowi fl = { .oif = ipc.oif,
+				    .mark = sk->sk_mark,
+				    .nl_u = { .ip4_u = {
+						.daddr = daddr,
+						.saddr = saddr,
+						.tos = tos } },
+				    .proto = IPPROTO_ICMP,
+				    .flags = inet_sk_flowi_flags(sk),
+		};
+
+		struct net *net = sock_net(sk);
+
+		security_sk_classify_flow(sk, &fl);
+		err = ip_route_output_flow(net, &rt, &fl, sk, 1);
+		if (err) {
+			if (err == -ENETUNREACH)
+				IP_INC_STATS_BH(net, IPSTATS_MIB_OUTNOROUTES);
+			goto out;
+		}
+
+		err = -EACCES;
+		if ((rt->rt_flags & RTCF_BROADCAST) &&
+		    !sock_flag(sk, SOCK_BROADCAST))
+			goto out;
+	}
+
+	if (msg->msg_flags & MSG_CONFIRM)
+		goto do_confirm;
+back_from_confirm:
+
+	if (!ipc.addr)
+		ipc.addr = rt->rt_dst;
+
+	lock_sock(sk);
+
+	pfh.icmph.type = user_icmph.type; /* already checked */
+	pfh.icmph.code = user_icmph.code; /* dtto */
+	pfh.icmph.checksum = 0;
+	pfh.icmph.un.echo.id = isk->inet_sport;
+	pfh.icmph.un.echo.sequence = user_icmph.un.echo.sequence;
+	pfh.iov = msg->msg_iov;
+	pfh.wcheck = 0;
+
+	err = ip_append_data(sk, ping_getfrag, &pfh, len,
+			0, &ipc, &rt,
+			msg->msg_flags);
+	if (err)
+		ip_flush_pending_frames(sk);
+	else
+		err = ping_push_pending_frames(sk, &pfh);
+	release_sock(sk);
+
+out:
+	ip_rt_put(rt);
+	if (free)
+		kfree(ipc.opt);
+	if (!err) {
+		icmp_out_count(sock_net(sk), user_icmph.type);
+		return len;
+	}
+	return err;
+
+do_confirm:
+	dst_confirm(&rt->dst);
+	if (!(msg->msg_flags & MSG_PROBE) || len)
+		goto back_from_confirm;
+	err = 0;
+	goto out;
+}
+
+/*
+ *	IOCTL requests applicable to the UDP^H^H^HICMP protocol
+ */
+
+int ping_ioctl(struct sock *sk, int cmd, unsigned long arg)
+{
+	pr_debug("ping_ioctl(sk=%p,sk->num=%u,cmd=%d,arg=%lu)\n",
+		inet_sk(sk), inet_sk(sk)->inet_num, cmd, arg);
+	switch (cmd) {
+	case SIOCOUTQ:
+	case SIOCINQ:
+		return udp_ioctl(sk, cmd, arg);
+	default:
+		return -ENOIOCTLCMD;
+	}
+}
+
+int ping_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
+		 size_t len, int noblock, int flags, int *addr_len)
+{
+	struct inet_sock *isk = inet_sk(sk);
+	struct sockaddr_in *sin = (struct sockaddr_in *)msg->msg_name;
+	struct sk_buff *skb;
+	int copied, err;
+
+	pr_debug("ping_recvmsg(sk=%p,sk->num=%u)\n", isk, isk->inet_num);
+
+	if (flags & MSG_OOB)
+		goto out;
+
+	if (addr_len)
+		*addr_len = sizeof(*sin);
+
+	if (flags & MSG_ERRQUEUE)
+		return ip_recv_error(sk, msg, len);
+
+	skb = skb_recv_datagram(sk, flags, noblock, &err);
+	if (!skb)
+		goto out;
+
+	copied = skb->len;
+	if (copied > len) {
+		msg->msg_flags |= MSG_TRUNC;
+		copied = len;
+	}
+
+	/* Don't bother checking the checksum */
+	err = skb_copy_datagram_iovec(skb, 0, msg->msg_iov, copied);
+	if (err)
+		goto done;
+
+	sock_recv_timestamp(msg, sk, skb);
+
+	/* Copy the address. */
+	if (sin) {
+		sin->sin_family = AF_INET;
+		sin->sin_port = 0 /* skb->h.uh->source */;
+		sin->sin_addr.s_addr = ip_hdr(skb)->saddr;
+		memset(sin->sin_zero, 0, sizeof(sin->sin_zero));
+	}
+	if (isk->cmsg_flags)
+		ip_cmsg_recv(msg, skb);
+	err = copied;
+
+done:
+	skb_free_datagram(sk, skb);
+out:
+	pr_debug("ping_recvmsg -> %d\n", err);
+	return err;
+}
+
+static int ping_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
+{
+	pr_debug("ping_queue_rcv_skb(sk=%p,sk->num=%d,skb=%p)\n",
+		inet_sk(sk), inet_sk(sk)->inet_num, skb);
+	if (sock_queue_rcv_skb(sk, skb) < 0) {
+		ICMP_INC_STATS_BH(sock_net(sk), ICMP_MIB_INERRORS);
+		kfree_skb(skb);
+		pr_debug("ping_queue_rcv_skb -> failed\n");
+		return -1;
+	}
+	return 0;
+}
+
+
+/*
+ *	All we need to do is get the socket.
+ */
+
+void ping_rcv(struct sk_buff *skb)
+{
+	struct sock *sk;
+	struct net *net = dev_net(skb->dev);
+	struct iphdr *iph = ip_hdr(skb);
+	struct icmphdr *icmph = icmp_hdr(skb);
+	u32 saddr = iph->saddr;
+	u32 daddr = iph->daddr;
+
+	/* We assume the packet has already been checked by icmp_rcv */
+
+	pr_debug("ping_rcv(skb=%p,id=%04x,seq=%04x)\n",
+		skb, ntohs(icmph->un.echo.id), ntohs(icmph->un.echo.sequence));
+
+	/* Push ICMP header back */
+	skb_push(skb, skb->data - (u8 *)icmph);
+
+	sk = ping_v4_lookup(net, saddr, daddr, ntohs(icmph->un.echo.id),
+			    skb->dev->ifindex);
+	if (sk != NULL) {
+		pr_debug("rcv on socket %p\n", sk);
+		ping_queue_rcv_skb(sk, skb_get(skb));
+		sock_put(sk);
+		return;
+	}
+	pr_debug("no socket, dropping\n");
+
+	/* We're called from icmp_rcv(). kfree_skb() is done there. */
+}
+
+struct proto ping_prot = {
+	.name =		"PING",
+	.owner =	THIS_MODULE,
+	.init =		ping_init_sock,
+	.close =	ping_close,
+	.connect =	ip4_datagram_connect,
+	.disconnect =	udp_disconnect,
+	.ioctl =	ping_ioctl,
+	.setsockopt =	ip_setsockopt,
+	.getsockopt =	ip_getsockopt,
+	.sendmsg =	ping_sendmsg,
+	.recvmsg =	ping_recvmsg,
+	.bind =		ping_bind,
+	.backlog_rcv =	ping_queue_rcv_skb,
+	.hash =		ping_v4_hash,
+	.unhash =	ping_v4_unhash,
+	.get_port =	ping_v4_get_port,
+	.obj_size =	sizeof(struct inet_sock),
+};
+EXPORT_SYMBOL(ping_prot);
+
+#ifdef CONFIG_PROC_FS
+
+static struct sock *ping_get_first(struct seq_file *seq, int start)
+{
+	struct sock *sk;
+	struct ping_iter_state *state = seq->private;
+	struct net *net = seq_file_net(seq);
+
+	for (state->bucket = start; state->bucket < PING_HTABLE_SIZE;
+	     ++state->bucket) {
+		struct hlist_nulls_node *node;
+		struct hlist_nulls_head *hslot = &ping_table.hash[state->bucket];
+
+		if (hlist_nulls_empty(hslot))
+			continue;
+
+		sk_nulls_for_each(sk, node, hslot) {
+			if (net_eq(sock_net(sk), net))
+				goto found;
+		}
+	}
+	sk = NULL;
+found:
+	return sk;
+}
+
+static struct sock *ping_get_next(struct seq_file *seq, struct sock *sk)
+{
+	struct ping_iter_state *state = seq->private;
+	struct net *net = seq_file_net(seq);
+
+	do {
+		sk = sk_nulls_next(sk);
+	} while (sk && (!net_eq(sock_net(sk), net)));
+
+	if (!sk)
+		return ping_get_first(seq, state->bucket + 1);
+	return sk;
+}
+
+static struct sock *ping_get_idx(struct seq_file *seq, loff_t pos)
+{
+	struct sock *sk = ping_get_first(seq, 0);
+
+	if (sk)
+		while (pos && (sk = ping_get_next(seq, sk)) != NULL)
+			--pos;
+	return pos ? NULL : sk;
+}
+
+static void *ping_seq_start(struct seq_file *seq, loff_t *pos)
+{
+	struct ping_iter_state *state = seq->private;
+	state->bucket = 0;
+
+	read_lock_bh(&ping_table.lock);
+
+	return *pos ? ping_get_idx(seq, *pos-1) : SEQ_START_TOKEN;
+}
+
+static void *ping_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+	struct sock *sk;
+
+	if (v == SEQ_START_TOKEN)
+		sk = ping_get_idx(seq, 0);
+	else
+		sk = ping_get_next(seq, v);
+
+	++*pos;
+	return sk;
+}
+
+static void ping_seq_stop(struct seq_file *seq, void *v)
+{
+	read_unlock_bh(&ping_table.lock);
+}
+
+static void ping_format_sock(struct sock *sp, struct seq_file *f,
+		int bucket, int *len)
+{
+	struct inet_sock *inet = inet_sk(sp);
+	__be32 dest = inet->inet_daddr;
+	__be32 src = inet->inet_rcv_saddr;
+	__u16 destp = ntohs(inet->inet_dport);
+	__u16 srcp = ntohs(inet->inet_sport);
+
+	seq_printf(f, "%5d: %08X:%04X %08X:%04X"
+		" %02X %08X:%08X %02X:%08lX %08X %5d %8d %lu %d %pK %d%n",
+		bucket, src, srcp, dest, destp, sp->sk_state,
+		sk_wmem_alloc_get(sp),
+		sk_rmem_alloc_get(sp),
+		0, 0L, 0, sock_i_uid(sp), 0, sock_i_ino(sp),
+		atomic_read(&sp->sk_refcnt), sp,
+		atomic_read(&sp->sk_drops), len);
+}
+
+static int ping_seq_show(struct seq_file *seq, void *v)
+{
+	if (v == SEQ_START_TOKEN)
+		seq_printf(seq, "%-127s\n",
+			   "  sl  local_address rem_address   st tx_queue "
+			   "rx_queue tr tm->when retrnsmt   uid  timeout "
+			   "inode ref pointer drops");
+	else {
+		struct ping_iter_state *state = seq->private;
+		int len;
+
+		ping_format_sock(v, seq, state->bucket, &len);
+		seq_printf(seq, "%*s\n", 127 - len, "");
+	}
+	return 0;
+}
+
+static const struct seq_operations ping_seq_ops = {
+	.show		= ping_seq_show,
+	.start		= ping_seq_start,
+	.next		= ping_seq_next,
+	.stop		= ping_seq_stop,
+};
+
+static int ping_seq_open(struct inode *inode, struct file *file)
+{
+	return seq_open_net(inode, file, &ping_seq_ops,
+			   sizeof(struct ping_iter_state));
+}
+
+static const struct file_operations ping_seq_fops = {
+	.open		= ping_seq_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= seq_release_net,
+};
+
+static int ping_proc_register(struct net *net)
+{
+	struct proc_dir_entry *p;
+	int rc = 0;
+
+	p = proc_net_fops_create(net, "icmp", S_IRUGO, &ping_seq_fops);
+	if (!p)
+		rc = -ENOMEM;
+	return rc;
+}
+
+static void ping_proc_unregister(struct net *net)
+{
+	proc_net_remove(net, "icmp");
+}
+
+
+static int __net_init ping_proc_init_net(struct net *net)
+{
+	return ping_proc_register(net);
+}
+
+static void __net_exit ping_proc_exit_net(struct net *net)
+{
+	ping_proc_unregister(net);
+}
+
+static struct pernet_operations ping_net_ops = {
+	.init = ping_proc_init_net,
+	.exit = ping_proc_exit_net,
+};
+
+int __init ping_proc_init(void)
+{
+	return register_pernet_subsys(&ping_net_ops);
+}
+
+void ping_proc_exit(void)
+{
+	unregister_pernet_subsys(&ping_net_ops);
+}
+
+#endif
+
+void __init ping_init(void)
+{
+	int i;
+
+	for (i = 0; i < PING_HTABLE_SIZE; i++)
+		INIT_HLIST_NULLS_HEAD(&ping_table.hash[i], i);
+	rwlock_init(&ping_table.lock);
+}
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 1a45665..c49403c 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -13,6 +13,7 @@
 #include <linux/seqlock.h>
 #include <linux/init.h>
 #include <linux/slab.h>
+#include <linux/nsproxy.h>
 #include <net/snmp.h>
 #include <net/icmp.h>
 #include <net/ip.h>
@@ -21,6 +22,7 @@
 #include <net/udp.h>
 #include <net/cipso_ipv4.h>
 #include <net/inet_frag.h>
+#include <net/ping.h>
 
 static int zero;
 static int tcp_retr1_max = 255;
@@ -30,6 +32,8 @@ static int tcp_adv_win_scale_min = -31;
 static int tcp_adv_win_scale_max = 31;
 static int ip_ttl_min = 1;
 static int ip_ttl_max = 255;
+static int ip_ping_group_range_min[] = { 0, 0 };
+static int ip_ping_group_range_max[] = { GID_T_MAX, GID_T_MAX };
 
 /* Update system visible IP port range */
 static void set_local_port_range(int range[2])
@@ -68,6 +72,65 @@ static int ipv4_local_port_range(ctl_table *table, int write,
 	return ret;
 }
 
+
+void inet_get_ping_group_range_net(struct net *net, gid_t *low, gid_t *high)
+{
+	gid_t *data = net->ipv4.sysctl_ping_group_range;
+	unsigned seq;
+	do {
+		seq = read_seqbegin(&sysctl_local_ports.lock);
+
+		*low = data[0];
+		*high = data[1];
+	} while (read_seqretry(&sysctl_local_ports.lock, seq));
+}
+
+void inet_get_ping_group_range_table(struct ctl_table *table, gid_t *low, gid_t *high)
+{
+	gid_t *data = table->data;
+	unsigned seq;
+	do {
+		seq = read_seqbegin(&sysctl_local_ports.lock);
+
+		*low = data[0];
+		*high = data[1];
+	} while (read_seqretry(&sysctl_local_ports.lock, seq));
+}
+
+/* Update system visible IP port range */
+static void set_ping_group_range(struct ctl_table *table, int range[2])
+{
+	gid_t *data = table->data;
+	write_seqlock(&sysctl_local_ports.lock);
+	data[0] = range[0];
+	data[1] = range[1];
+	write_sequnlock(&sysctl_local_ports.lock);
+}
+
+/* Validate changes from /proc interface. */
+static int ipv4_ping_group_range(ctl_table *table, int write,
+				 void __user *buffer,
+				 size_t *lenp, loff_t *ppos)
+{
+	int ret;
+	gid_t range[2];
+	ctl_table tmp = {
+		.data = &range,
+		.maxlen = sizeof(range),
+		.mode = table->mode,
+		.extra1 = &ip_ping_group_range_min,
+		.extra2 = &ip_ping_group_range_max,
+	};
+
+	inet_get_ping_group_range_table(table, range, range + 1);
+	ret = proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos);
+
+	if (write && ret == 0)
+		set_ping_group_range(table, range);
+
+	return ret;
+}
+
 static int proc_tcp_congestion_control(ctl_table *ctl, int write,
 				       void __user *buffer, size_t *lenp, loff_t *ppos)
 {
@@ -680,6 +743,13 @@ static struct ctl_table ipv4_net_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec
 	},
+	{
+		.procname	= "ping_group_range",
+		.data		= &init_net.ipv4.sysctl_ping_group_range,
+		.maxlen		= sizeof(init_net.ipv4.sysctl_ping_group_range),
+		.mode		= 0644,
+		.proc_handler	= ipv4_ping_group_range,
+	},
 	{ }
 };
 
@@ -714,8 +784,18 @@ static __net_init int ipv4_sysctl_init_net(struct net *net)
 			&net->ipv4.sysctl_icmp_ratemask;
 		table[6].data =
 			&net->ipv4.sysctl_rt_cache_rebuild_count;
+		table[7].data =
+			&net->ipv4.sysctl_ping_group_range;
+
 	}
 
+	/*
+	 * Sane defaults - nobody may create ping sockets.
+	 * Boot scripts should set this to distro-specific group.
+	 */
+	net->ipv4.sysctl_ping_group_range[0] = 1;
+	net->ipv4.sysctl_ping_group_range[1] = 0;
+
 	net->ipv4.sysctl_rt_cache_rebuild_count = 4;
 
 	net->ipv4.ipv4_hdr = register_net_sysctl_table(net,
-- 
1.7.0.4

^ permalink raw reply related

* Re: [Bugme-new] [Bug 33502] New: Caught 64-bit read from uninitialized memory in __alloc_skb
From: Christoph Lameter @ 2011-05-10 18:07 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Vegard Nossum, Pekka Enberg, casteyde.christian, Andrew Morton,
	netdev, bugzilla-daemon, bugme-daemon
In-Reply-To: <alpine.DEB.2.00.1105101242420.2875@router.home>

There is a simpler version and we can get away without interrupt disable I
think. The value that we get from the read does not matter since the TID
will not match.


Subject: slub: Make CONFIG_PAGE_ALLOC work with new fastpath

Fastpath can do a speculative access to a page that CONFIG_PAGE_ALLOC may have
marked as invalid to retrieve the pointer to the next free object.

Probe that address before dereferencing the pointer to the page.

Signed-off-by: Christoph Lameter <cl@linux.com>
---
 mm/slub.c |   14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2011-05-10 12:54:00.000000000 -0500
+++ linux-2.6/mm/slub.c	2011-05-10 13:04:18.000000000 -0500
@@ -261,6 +261,18 @@ static inline void *get_freepointer(stru
 	return *(void **)(object + s->offset);
 }

+static inline void *get_freepointer_safe(struct kmem_cache *s, void *object)
+{
+	void *p;
+
+#ifdef CONFIG_DEBUG_PAGEALLOC
+	probe_kernel_read(&p, (void **)(object + s->offset), sizeof(p));
+#else
+	p = get_freepointer(s, object);
+#endif
+	return p;
+}
+
 static inline void set_freepointer(struct kmem_cache *s, void *object, void *fp)
 {
 	*(void **)(object + s->offset) = fp;
@@ -1943,7 +1955,7 @@ redo:
 		if (unlikely(!irqsafe_cpu_cmpxchg_double(
 				s->cpu_slab->freelist, s->cpu_slab->tid,
 				object, tid,
-				get_freepointer(s, object), next_tid(tid)))) {
+				get_freepointer_safe(s, object), next_tid(tid)))) {

 			note_cmpxchg_failure("slab_alloc", s, tid);
 			goto redo;

^ permalink raw reply

* Re: [Bugme-new] [Bug 33502] New: Caught 64-bit read from uninitialized memory in __alloc_skb
From: Eric Dumazet @ 2011-05-10 18:05 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Vegard Nossum, Pekka Enberg, casteyde.christian, Andrew Morton,
	netdev, bugzilla-daemon, bugme-daemon
In-Reply-To: <alpine.DEB.2.00.1105101242420.2875@router.home>

Le mardi 10 mai 2011 à 12:43 -0500, Christoph Lameter a écrit :
> Draft for a patch
> 
> 
> Subject: slub: Make CONFIG_PAGE_ALLOC work with new fastpath
> 
> Fastpath can do a speculative access to a page that CONFIG_PAGE_ALLOC may have
> marked as invalid to retrieve the pointer to the next free object.
> 
> Probe that address before dereferencing the pointer to the page.
> All of that needs to occur with interrupts disabled since an interrupt
> could cause the page status to change (as pointed out by Eric).
> 
> Signed-off-by: Christoph Lameter <cl@linux.com>
> ---
>  mm/slub.c |   23 ++++++++++++++++++++++-
>  1 file changed, 22 insertions(+), 1 deletion(-)
> 
> Index: linux-2.6/mm/slub.c
> ===================================================================
> --- linux-2.6.orig/mm/slub.c	2011-05-10 12:35:30.000000000 -0500
> +++ linux-2.6/mm/slub.c	2011-05-10 12:38:53.000000000 -0500
> @@ -261,6 +261,27 @@ static inline void *get_freepointer(stru
>  	return *(void **)(object + s->offset);
>  }
> 
> +static inline void *get_freepointer_safe(struct kmem_cache *s, void *object)
> +{
> +	void *p;
> +
> +#ifdef CONFIG_PAGE_ALLOC
> +	unsigned long flags;
> +
> +	local_irq_save(flags);
> +
> +	if (probe_kernel_address(object))
> +		p = NULL;	/* Invalid */
> +	else
> +		p = get_freepointer(s, object);
> +
> +	local_irq_restore(flags);
> +#else
> +	p = get_freepointer(s, object);
> +#endif
> +	return p;
> +}
> +
>  static inline void set_freepointer(struct kmem_cache *s, void *object, void *fp)
>  {
>  	*(void **)(object + s->offset) = fp;
> @@ -1933,7 +1954,7 @@ redo:
>  		if (unlikely(!irqsafe_cpu_cmpxchg_double(
>  				s->cpu_slab->freelist, s->cpu_slab->tid,
>  				object, tid,
> -				get_freepointer(s, object), next_tid(tid)))) {
> +				get_freepointer_safe(s, object), next_tid(tid)))) {
> 
>  			note_cmpxchg_failure("slab_alloc", s, tid);
>  			goto redo;


Really this wont work Stephen

You have to disable IRQ _before_ even fetching 'object'

Or else, you can have an IRQ, allocate this object, pass to another cpu.

This other cpu can free the object and unmap page right after you did
the probe_kernel_address(object) (successfully), and before your cpu :

p = get_freepointer(s, object); << BUG >>

I really dont understand your motivation to keep the buggy commit.




^ permalink raw reply

* Re: Bug#625914: linux-image-2.6.38-2-amd64: bridging is not interacting well with multicast in 2.6.38-4
From: Noah Meyerhans @ 2011-05-10 18:05 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: 625914, netdev, bridge
In-Reply-To: <1305031369.4065.259.camel@localhost>

[-- Attachment #1: Type: text/plain, Size: 2497 bytes --]

On Tue, May 10, 2011 at 01:42:49PM +0100, Ben Hutchings wrote:
> > > This is pretty weird.  Debian version 2.6.38-3 has a few bridging
> > > changes from stable 2.6.38.3 and 2.6.38.4, but they don't look like they
> > > would cause this.
> > 
> > I have apparently filed the bug against the wrong version of Debian's
> > kernel.  2.6.38-3 is not affected, and works as expected.  The change
> > was introduced in -4.  That may have been clear from the report itself,
> > but the report was filed against -3.  I've fixed that in the BTS.
> 
> I gathered that, and then made the same mistake in writing the above!
> The version with the regression, 2.6.38-4, includes the changes from
> stable 2.6.38.3 and 2.6.38.4

With a little help from git bisect, I've tracked this regression down to
the following commit to the stable-2.6.38.y tree:

commit 5f1c356a3fadc0c19922d660da723b79bcc9aad7
Author: Herbert Xu <herbert@gondor.apana.org.au>
Date:   Fri Mar 18 05:27:28 2011 +0000

    bridge: Reset IPCB when entering IP stack on NF_FORWARD
    
    [ Upstream commit 6b1e960fdbd75dcd9bcc3ba5ff8898ff1ad30b6e ]
    
    Whenever we enter the IP stack proper from bridge netfilter we
    need to ensure that the skb is in a form the IP stack expects
    it to be in.
    
    The entry point on NF_FORWARD did not meet the requirements of
    the IP stack, therefore leading to potential crashes/panics.
    
    This patch fixes the problem.
    
    Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
    Acked-by: Stephen Hemminger <shemminger@vyatta.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

The diff is
diff --git a/net/bridge/br_netfilter.c b/net/bridge/br_netfilter.c
index 4b5b66d..49d50ea 100644
--- a/net/bridge/br_netfilter.c
+++ b/net/bridge/br_netfilter.c
@@ -741,6 +741,9 @@ static unsigned int br_nf_forward_ip(unsigned int
hook, struct sk_buff *skb,
                nf_bridge->mask |= BRNF_PKT_TYPE;
        }
 
+       if (br_parse_ip_options(skb))
+               return NF_DROP;
+
        /* The physdev module checks on this */
        nf_bridge->mask |= BRNF_BRIDGED;
        nf_bridge->physoutdev = skb->dev;

If I revert this change, network connectivity functions as expected for
the VMs on this host.

I don't know enough about this change or the problem it was supposed to
solve to be able to guess about what's going wrong.

noah


[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply related

* Re: [Bugme-new] [Bug 33502] New: Caught 64-bit read from uninitialized memory in __alloc_skb
From: Christoph Lameter @ 2011-05-10 17:43 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Vegard Nossum, Pekka Enberg, casteyde.christian, Andrew Morton,
	netdev, bugzilla-daemon, bugme-daemon
In-Reply-To: <alpine.DEB.2.00.1105101226460.2875@router.home>


Draft for a patch


Subject: slub: Make CONFIG_PAGE_ALLOC work with new fastpath

Fastpath can do a speculative access to a page that CONFIG_PAGE_ALLOC may have
marked as invalid to retrieve the pointer to the next free object.

Probe that address before dereferencing the pointer to the page.
All of that needs to occur with interrupts disabled since an interrupt
could cause the page status to change (as pointed out by Eric).

Signed-off-by: Christoph Lameter <cl@linux.com>
---
 mm/slub.c |   23 ++++++++++++++++++++++-
 1 file changed, 22 insertions(+), 1 deletion(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2011-05-10 12:35:30.000000000 -0500
+++ linux-2.6/mm/slub.c	2011-05-10 12:38:53.000000000 -0500
@@ -261,6 +261,27 @@ static inline void *get_freepointer(stru
 	return *(void **)(object + s->offset);
 }

+static inline void *get_freepointer_safe(struct kmem_cache *s, void *object)
+{
+	void *p;
+
+#ifdef CONFIG_PAGE_ALLOC
+	unsigned long flags;
+
+	local_irq_save(flags);
+
+	if (probe_kernel_address(object))
+		p = NULL;	/* Invalid */
+	else
+		p = get_freepointer(s, object);
+
+	local_irq_restore(flags);
+#else
+	p = get_freepointer(s, object);
+#endif
+	return p;
+}
+
 static inline void set_freepointer(struct kmem_cache *s, void *object, void *fp)
 {
 	*(void **)(object + s->offset) = fp;
@@ -1933,7 +1954,7 @@ redo:
 		if (unlikely(!irqsafe_cpu_cmpxchg_double(
 				s->cpu_slab->freelist, s->cpu_slab->tid,
 				object, tid,
-				get_freepointer(s, object), next_tid(tid)))) {
+				get_freepointer_safe(s, object), next_tid(tid)))) {

 			note_cmpxchg_failure("slab_alloc", s, tid);
 			goto redo;

^ permalink raw reply

* Re: [Bugme-new] [Bug 33502] New: Caught 64-bit read from uninitialized memory in __alloc_skb
From: Christoph Lameter @ 2011-05-10 17:30 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Vegard Nossum, Pekka Enberg, casteyde.christian, Andrew Morton,
	netdev, bugzilla-daemon, bugme-daemon
In-Reply-To: <1305047682.2758.1.camel@edumazet-laptop>

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1022 bytes --]

On Tue, 10 May 2011, Eric Dumazet wrote:

> Le mardi 10 mai 2011 à 11:39 -0500, Christoph Lameter a écrit :
>
> > #ifdef CONFIG_DEBUG_PAGE_ALLOC
> > 	if (illegal_page_alloc-address(object))
> > 		goto redo;
> > #endif
> >
> > before the cmpxchg should do the trick.
> >
>
> Again, it wont work...
>
> You can have an IRQ right after the check and before cmpxchg

Ok guess then we also need to disable irq if CONFIG_PAGE_ALLOC is set?

The cmpxchg is not the problem. The problem is the following expression
which retrieves the pointer to the next available object from the object
on the page:

get_freepointer(s, object)

In the CONFIG_PAGE_ALLOC case we could disable interrupts, then do the
check, then fetch the pointer and then reenable interrupts.

All of this can occur before the cmpxchg.

> This interrupt can allocate this block of memory, free it, and unmap
> page from memory.
>
> cmpxchg() reads unmapped memory -> BUG

The cmpxchg is not accessing any memory on the page.

^ permalink raw reply

* Re: [Bugme-new] [Bug 33502] New: Caught 64-bit read from uninitialized memory in __alloc_skb
From: Eric Dumazet @ 2011-05-10 17:14 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Vegard Nossum, Pekka Enberg, casteyde.christian, Andrew Morton,
	netdev, bugzilla-daemon, bugme-daemon
In-Reply-To: <alpine.DEB.2.00.1105101133440.2611@router.home>

Le mardi 10 mai 2011 à 11:39 -0500, Christoph Lameter a écrit :

> #ifdef CONFIG_DEBUG_PAGE_ALLOC
> 	if (illegal_page_alloc-address(object))
> 		goto redo;
> #endif
> 
> before the cmpxchg should do the trick.
> 

Again, it wont work...

You can have an IRQ right after the check and before cmpxchg

This interrupt can allocate this block of memory, free it, and unmap
page from memory.

cmpxchg() reads unmapped memory -> BUG




^ permalink raw reply

* Re: [Bugme-new] [Bug 33502] New: Caught 64-bit read from uninitialized memory in __alloc_skb
From: Christoph Lameter @ 2011-05-10 16:39 UTC (permalink / raw)
  To: Vegard Nossum
  Cc: Pekka Enberg, Eric Dumazet, casteyde.christian, Andrew Morton,
	netdev, bugzilla-daemon, bugme-daemon
In-Reply-To: <a49ddd5511b74b8d9b81af8c3ef72d5a@ulrik.uio.no>

On Tue, 10 May 2011, Vegard Nossum wrote:

> Presumably the problem is that the page can get freed, and that with
> DEBUG_PAGEALLOC, the page will therefore not be present and subsequently
> trigger a page fault when doing this cmpxchg() on the possibly freed object.

The problem is not the cmpxchg. The cmpxchg is occurring on the per cpu
structure for the slab and that remains even if the page is freed.

The problem is the speculative fetch of the address of the following
object from a pointer into the page. The cmpxchg will fail in that case
because the TID was incremented and the result of the address fetch will
be discarded.

> Regardless of DEBUG_PAGEALLOC or kmemcheck, what happens if the page gets
> freed, then allocated again for a completely different purpose in another part
> of the kernel, and new user of the page by chance writes the same "tid" number
> that the cmpxchg() is expecting?

The tid is not stored in the page struct but in a per cpu structure.

Doing an explicit check if this is an illegal address in the PAGE_ALLOC
case and redoing the loop will address the issue.

So

#ifdef CONFIG_DEBUG_PAGE_ALLOC
	if (illegal_page_alloc-address(object))
		goto redo;
#endif

before the cmpxchg should do the trick.

^ permalink raw reply

* Re: [Bugme-new] [Bug 33502] New: Caught 64-bit read from uninitialized memory in __alloc_skb
From: Christoph Lameter @ 2011-05-10 16:33 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Eric Dumazet, casteyde.christian, Andrew Morton, netdev,
	bugzilla-daemon, bugme-daemon, Vegard Nossum
In-Reply-To: <4DC91137.4030109@cs.helsinki.fi>

[-- Attachment #1: Type: TEXT/PLAIN, Size: 522 bytes --]

On Tue, 10 May 2011, Pekka Enberg wrote:

> On 5/10/11 1:17 PM, Eric Dumazet wrote:
> > Le mardi 10 mai 2011 à 13:03 +0300, Pekka Enberg a écrit :
> >
> > > Can't we fix the issue by putting kmemcheck_mark_initialized() to
> > > set_freepointer()?
> >
> > This would solve kmemcheck problem, not DEBUG_PAGEALLOC
>
> Oh, right. Christoph? We need to support DEBUG_PAGEALLOC with SLUB.

Well back to the #ifdef then? Or has DEBUG_PAGEALLOC some override
mechanism that we can do a speculative memory access?

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox