Re: [RFC PATCH 02/12] On Tue, 23 Sep 2008, David Miller wrote:

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [RFC PATCH 02/12] On Tue, 23 Sep 2008, David Miller wrote:
       [not found]         ` <4807377b0810031628x43f79eferdbb9c9c264a5816e@mail.gmail.com>
@ 2008-10-03 23:30           ` Jesse Brandeburg
       [not found]           ` <alpine.LNX.1.10.0810041219130.26779@jikos.suse.cz>
  1 sibling, 0 replies; 8+ messages in thread
From: Jesse Brandeburg @ 2008-10-03 23:30 UTC (permalink / raw)
  To: Jiri Kosina
  Cc: Jesse Barnes, David Miller, jesse.brandeburg, linux-kernel,
	netdev, kkeil, agospoda, arjan, david.graham, bruce.w.allan,
	john.ronciak, Thomas Gleixner, chris.jones, tim.gardner, airlied,
	Olaf Kirch, Linus Torvalds

On Fri, Oct 3, 2008 at 4:28 PM, Jesse Brandeburg
<jesse.brandeburg@gmail.com> wrote:
> On Fri, Oct 3, 2008 at 2:45 PM, Jiri Kosina <jkosina@suse.cz> wrote:
>> Karsten has been testing kernel with these three patches from the series
>> applied:
>>
>>        e1000e: reset swflag after resetting hardware
>>        e1000e: fix lockdep issues
>>        e1000e: debug contention on NVM SWFLAG
>>
>> This was done on a hardware which previously triggered the bug in just a
>> few test iterations in quite a reliable way. Now, with these patches
>> applied, the EEPROM corruption didn't happen after several tens of
>> iterations.
>>
>> Please note, that the patch that disables the writes to EEPROM on hardware
>> level was *not* involved in this testing.
>>
>> Therefore it currently seems that these three patches really address the
>> race condition issue that was present in the e1000e driver.
>
> Our experience is different.  We are also testing with the "protection
> patch" reverted.
>
> We see that the problem specifically comes and goes when
> removing/adding the use of set_memory_ro/set_memory_rw to the driver.
>
> I'm working to catch the bad element in the act with a hardware
> breakpoint or an ITP (we're trying both)
>
>> It is still not clear why the bug started triggering all of a sudden for
>> so many people though.
>
> we plan to keep on working on this until we understand what is going on.

I removed the bad addresses from the cc: list

^ permalink raw reply	[flat|nested] 8+ messages in thread

[parent not found: <alpine.LNX.1.10.0810041219130.26779@jikos.suse.cz>]

[parent not found: <alpine.LFD.2.00.0810041236250.4404@apollo>]

* Re: [RFC PATCH 02/12] On Tue, 23 Sep 2008, David Miller wrote:
       [not found]             ` <alpine.LFD.2.00.0810041236250.4404@apollo>
@ 2008-10-05  1:24               ` Jesse Brandeburg
  2008-10-05  8:51                 ` Thomas Gleixner
  0 siblings, 1 reply; 8+ messages in thread
From: Jesse Brandeburg @ 2008-10-05  1:24 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jiri Kosina, Jesse Barnes, David Miller, jesse.brandeburg,
	linux-kernel, netdev, kkeil, agospoda, arjan, david.graham,
	bruce.w.allan, john.ronciak, chris.jones, tim.gardner, airlied,
	Olaf Kirch, Linus Torvalds

On Sat, Oct 4, 2008 at 4:02 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Sat, 4 Oct 2008, Jiri Kosina wrote:
>> On Fri, 3 Oct 2008, Jesse Brandeburg wrote:
>> > Our experience is different.  We are also testing with the "protection
>> > patch" reverted.
>> > We see that the problem specifically comes and goes when
>> > removing/adding the use of set_memory_ro/set_memory_rw to the driver.
>>
>> But if this patch (which is an obvious workaround, compared to the other
>> patches which fix real bugs, right?) would be catching some malicious
>> accessess to the mapped EEPROM, there should be stacktraces present in the
>> kernel log, right?

yes, but I think it is just changing timing, i don't see any backtraces either.

> Exactly. The access to a ro region results in a fault. I have nowhere
> seen that trigger, but I can reproduce the trylock() WARN_ON, which
> confirms that there is concurrent access to the NVRAM registers. The
> backtrace pattern is similar to the one you have seen.

are you still getting WARN_ON *with* all the mutex based fixes already applied?

with the mutex patches in place (without protection patch) we are
still reproducing the issue, until we apply the set_memory_ro patch.
I had no luck on friday setting a hardware breakpoint on memory access
with kgdb to catch the writer with a breakpoint.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH 02/12] On Tue, 23 Sep 2008, David Miller wrote:
  2008-10-05  1:24               ` Jesse Brandeburg
@ 2008-10-05  8:51                 ` Thomas Gleixner
  2008-10-05 15:05                   ` Arjan van de Ven
  0 siblings, 1 reply; 8+ messages in thread
From: Thomas Gleixner @ 2008-10-05  8:51 UTC (permalink / raw)
  To: Jesse Brandeburg
  Cc: Jiri Kosina, Jesse Barnes, David Miller, jesse.brandeburg,
	linux-kernel, netdev, kkeil, agospoda, arjan, david.graham,
	bruce.w.allan, john.ronciak, chris.jones, tim.gardner, airlied,
	Olaf Kirch, Linus Torvalds

On Sat, 4 Oct 2008, Jesse Brandeburg wrote:
> > Exactly. The access to a ro region results in a fault. I have nowhere
> > seen that trigger, but I can reproduce the trylock() WARN_ON, which
> > confirms that there is concurrent access to the NVRAM registers. The
> > backtrace pattern is similar to the one you have seen.
> 
> are you still getting WARN_ON *with* all the mutex based fixes already applied?

The WARN_ON triggers with current mainline. Is there any fixlet in
Linus tree missing ?

> with the mutex patches in place (without protection patch) we are
> still reproducing the issue, until we apply the set_memory_ro patch.

That does not make sense to me. If the memory_ro patch is providing
_real_ protection then you _must_ run into an access violation. If not,
then the patch just papers over the real problem in some mysterious
way.

The patch does:

+	set_memory_rw((unsigned long)hw->flash_address,
+	              hw->flash_len >> PAGE_SHIFT);
 	writew(val, hw->flash_address + reg);
+	set_memory_ro((unsigned long)hw->flash_address,
+	              hw->flash_len >> PAGE_SHIFT);

This changes massively the timing of the flash access. Could this be
the problem on the machine which needs the set_memory_ro patch to
survive ?

> I had no luck on friday setting a hardware breakpoint on memory access
> with kgdb to catch the writer with a breakpoint.

Well, why should you get a hardware breakpoint when the _ro protection
does not trigger in the first place ?

Granted there could be a _rw alias mapping, but then the problem must
be still visible with the _ro patch applied.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH 02/12] On Tue, 23 Sep 2008, David Miller wrote:
  2008-10-05  8:51                 ` Thomas Gleixner
@ 2008-10-05 15:05                   ` Arjan van de Ven
  2008-10-05 15:55                     ` Thomas Gleixner
  0 siblings, 1 reply; 8+ messages in thread
From: Arjan van de Ven @ 2008-10-05 15:05 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jesse Brandeburg, Jiri Kosina, Jesse Barnes, David Miller,
	jesse.brandeburg, linux-kernel, netdev, kkeil, agospoda,
	david.graham, bruce.w.allan, john.ronciak, chris.jones,
	tim.gardner, airlied, Olaf Kirch, Linus Torvalds

Thomas Gleixner wrote:
> On Sat, 4 Oct 2008, Jesse Brandeburg wrote:
>>> Exactly. The access to a ro region results in a fault. I have nowhere
>>> seen that trigger, but I can reproduce the trylock() WARN_ON, which
>>> confirms that there is concurrent access to the NVRAM registers. The
>>> backtrace pattern is similar to the one you have seen.
>> are you still getting WARN_ON *with* all the mutex based fixes already applied?
> 
> The WARN_ON triggers with current mainline. Is there any fixlet in
> Linus tree missing ?
> 
>> with the mutex patches in place (without protection patch) we are
>> still reproducing the issue, until we apply the set_memory_ro patch.
> 
> That does not make sense to me. If the memory_ro patch is providing
> _real_ protection then you _must_ run into an access violation. If not,
> then the patch just papers over the real problem in some mysterious
> way.
> 

not if the bad code is doing copy_to_user .... (or similar)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH 02/12] On Tue, 23 Sep 2008, David Miller wrote:
  2008-10-05 15:05                   ` Arjan van de Ven
@ 2008-10-05 15:55                     ` Thomas Gleixner
  2008-10-05 16:02                       ` Arjan van de Ven
  0 siblings, 1 reply; 8+ messages in thread
From: Thomas Gleixner @ 2008-10-05 15:55 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Jesse Brandeburg, Jiri Kosina, Jesse Barnes, David Miller,
	jesse.brandeburg, linux-kernel, netdev, kkeil, agospoda,
	david.graham, bruce.w.allan, john.ronciak, chris.jones,
	tim.gardner, airlied, Olaf Kirch, Linus Torvalds

On Sun, 5 Oct 2008, Arjan van de Ven wrote:
> Thomas Gleixner wrote:
> > On Sat, 4 Oct 2008, Jesse Brandeburg wrote:
> > > > Exactly. The access to a ro region results in a fault. I have nowhere
> > > > seen that trigger, but I can reproduce the trylock() WARN_ON, which
> > > > confirms that there is concurrent access to the NVRAM registers. The
> > > > backtrace pattern is similar to the one you have seen.
> > > are you still getting WARN_ON *with* all the mutex based fixes already
> > > applied?
> > 
> > The WARN_ON triggers with current mainline. Is there any fixlet in
> > Linus tree missing ?
> > 
> > > with the mutex patches in place (without protection patch) we are
> > > still reproducing the issue, until we apply the set_memory_ro patch.
> > 
> > That does not make sense to me. If the memory_ro patch is providing
> > _real_ protection then you _must_ run into an access violation. If not,
> > then the patch just papers over the real problem in some mysterious
> > way.
> > 
> 
> not if the bad code is doing copy_to_user .... (or similar)

You mean: copy_from_user :) This would require that the e1000e
nvram region is writable via copy_from_user by an e1000e user space
interface. A quick grep does not reviel such a horrible interface.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH 02/12] On Tue, 23 Sep 2008, David Miller wrote:
  2008-10-05 15:55                     ` Thomas Gleixner
@ 2008-10-05 16:02                       ` Arjan van de Ven
  2008-10-05 16:16                         ` Thomas Gleixner
  0 siblings, 1 reply; 8+ messages in thread
From: Arjan van de Ven @ 2008-10-05 16:02 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jesse Brandeburg, Jiri Kosina, Jesse Barnes, David Miller,
	jesse.brandeburg, linux-kernel, netdev, kkeil, agospoda,
	david.graham, bruce.w.allan, john.ronciak, chris.jones,
	tim.gardner, airlied, Olaf Kirch, Linus Torvalds

On Sun, 5 Oct 2008 17:55:14 +0200 (CEST)
Thomas Gleixner <tglx@linutronix.de> wrote:

> On Sun, 5 Oct 2008, Arjan van de Ven wrote:
> > Thomas Gleixner wrote:
> > > On Sat, 4 Oct 2008, Jesse Brandeburg wrote:
> > > > > Exactly. The access to a ro region results in a fault. I have
> > > > > nowhere seen that trigger, but I can reproduce the trylock()
> > > > > WARN_ON, which confirms that there is concurrent access to
> > > > > the NVRAM registers. The backtrace pattern is similar to the
> > > > > one you have seen.
> > > > are you still getting WARN_ON *with* all the mutex based fixes
> > > > already applied?
> > > 
> > > The WARN_ON triggers with current mainline. Is there any fixlet in
> > > Linus tree missing ?
> > > 
> > > > with the mutex patches in place (without protection patch) we
> > > > are still reproducing the issue, until we apply the
> > > > set_memory_ro patch.
> > > 
> > > That does not make sense to me. If the memory_ro patch is
> > > providing _real_ protection then you _must_ run into an access
> > > violation. If not, then the patch just papers over the real
> > > problem in some mysterious way.
> > > 
> > 
> > not if the bad code is doing copy_to_user .... (or similar)
> 
> You mean: copy_from_user :) This would require that the e1000e
> nvram region is writable via copy_from_user by an e1000e user space
> interface. A quick grep does not reviel such a horrible interface.

I meant a "copy_to_user" to a duff pointer, somewhere in the kernel.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH 02/12] On Tue, 23 Sep 2008, David Miller wrote:
  2008-10-05 16:02                       ` Arjan van de Ven
@ 2008-10-05 16:16                         ` Thomas Gleixner
  2008-10-05 17:01                           ` Arjan van de Ven
  0 siblings, 1 reply; 8+ messages in thread
From: Thomas Gleixner @ 2008-10-05 16:16 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Jesse Brandeburg, Jiri Kosina, Jesse Barnes, David Miller,
	jesse.brandeburg, linux-kernel, netdev, kkeil, agospoda,
	david.graham, bruce.w.allan, john.ronciak, chris.jones,
	tim.gardner, airlied, Olaf Kirch, Linus Torvalds

On Sun, 5 Oct 2008, Arjan van de Ven wrote:
> On Sun, 5 Oct 2008 17:55:14 +0200 (CEST)
> Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> > On Sun, 5 Oct 2008, Arjan van de Ven wrote:
> > > Thomas Gleixner wrote:
> > > > On Sat, 4 Oct 2008, Jesse Brandeburg wrote:
> > > > > > Exactly. The access to a ro region results in a fault. I have
> > > > > > nowhere seen that trigger, but I can reproduce the trylock()
> > > > > > WARN_ON, which confirms that there is concurrent access to
> > > > > > the NVRAM registers. The backtrace pattern is similar to the
> > > > > > one you have seen.
> > > > > are you still getting WARN_ON *with* all the mutex based fixes
> > > > > already applied?
> > > > 
> > > > The WARN_ON triggers with current mainline. Is there any fixlet in
> > > > Linus tree missing ?
> > > > 
> > > > > with the mutex patches in place (without protection patch) we
> > > > > are still reproducing the issue, until we apply the
> > > > > set_memory_ro patch.
> > > > 
> > > > That does not make sense to me. If the memory_ro patch is
> > > > providing _real_ protection then you _must_ run into an access
> > > > violation. If not, then the patch just papers over the real
> > > > problem in some mysterious way.
> > > > 
> > > 
> > > not if the bad code is doing copy_to_user .... (or similar)
> > 
> > You mean: copy_from_user :) This would require that the e1000e
> > nvram region is writable via copy_from_user by an e1000e user space
> > interface. A quick grep does not reviel such a horrible interface.
> 
> I meant a "copy_to_user" to a duff pointer, somewhere in the kernel.

Hmm, don't we check the *to address on copy_to_user ?

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH 02/12] On Tue, 23 Sep 2008, David Miller wrote:
  2008-10-05 16:16                         ` Thomas Gleixner
@ 2008-10-05 17:01                           ` Arjan van de Ven
  0 siblings, 0 replies; 8+ messages in thread
From: Arjan van de Ven @ 2008-10-05 17:01 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jesse Brandeburg, Jiri Kosina, Jesse Barnes, David Miller,
	jesse.brandeburg, linux-kernel, netdev, kkeil, agospoda,
	david.graham, bruce.w.allan, john.ronciak, chris.jones,
	tim.gardner, airlied, Olaf Kirch, Linus Torvalds

On Sun, 5 Oct 2008 18:16:29 +0200 (CEST)
Thomas Gleixner <tglx@linutronix.de> wrote:

> On Sun, 5 Oct 2008, Arjan van de Ven wrote:
> > On Sun, 5 Oct 2008 17:55:14 +0200 (CEST)
> > Thomas Gleixner <tglx@linutronix.de> wrote:
> > 
> > > On Sun, 5 Oct 2008, Arjan van de Ven wrote:
> > > > Thomas Gleixner wrote:
> > > > > On Sat, 4 Oct 2008, Jesse Brandeburg wrote:
> > > > > > > Exactly. The access to a ro region results in a fault. I
> > > > > > > have nowhere seen that trigger, but I can reproduce the
> > > > > > > trylock() WARN_ON, which confirms that there is
> > > > > > > concurrent access to the NVRAM registers. The backtrace
> > > > > > > pattern is similar to the one you have seen.
> > > > > > are you still getting WARN_ON *with* all the mutex based
> > > > > > fixes already applied?
> > > > > 
> > > > > The WARN_ON triggers with current mainline. Is there any
> > > > > fixlet in Linus tree missing ?
> > > > > 
> > > > > > with the mutex patches in place (without protection patch)
> > > > > > we are still reproducing the issue, until we apply the
> > > > > > set_memory_ro patch.
> > > > > 
> > > > > That does not make sense to me. If the memory_ro patch is
> > > > > providing _real_ protection then you _must_ run into an access
> > > > > violation. If not, then the patch just papers over the real
> > > > > problem in some mysterious way.
> > > > > 
> > > > 
> > > > not if the bad code is doing copy_to_user .... (or similar)
> > > 
> > > You mean: copy_from_user :) This would require that the e1000e
> > > nvram region is writable via copy_from_user by an e1000e user
> > > space interface. A quick grep does not reviel such a horrible
> > > interface.
> > 
> > I meant a "copy_to_user" to a duff pointer, somewhere in the kernel.
> 
> Hmm, don't we check the *to address on copy_to_user ?
> 

fair point

and we do exception catching for copy_from_user as well on the source,
just by how it's implemented

-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2008-10-05 17:01 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20080930030825.22950.18891.stgit@jbrandeb-bw.jf.intel.com>
     [not found] ` <200810021523.45884.jbarnes@virtuousgeek.org>
     [not found]   ` <20081003.134634.240211201.davem@davemloft.net>
     [not found]     ` <200810031429.22598.jbarnes@virtuousgeek.org>
     [not found]       ` <alpine.LNX.1.10.0810032338140.26779@jikos.suse.cz>
     [not found]         ` <4807377b0810031628x43f79eferdbb9c9c264a5816e@mail.gmail.com>
2008-10-03 23:30           ` [RFC PATCH 02/12] On Tue, 23 Sep 2008, David Miller wrote: Jesse Brandeburg
     [not found]           ` <alpine.LNX.1.10.0810041219130.26779@jikos.suse.cz>
     [not found]             ` <alpine.LFD.2.00.0810041236250.4404@apollo>
2008-10-05  1:24               ` Jesse Brandeburg
2008-10-05  8:51                 ` Thomas Gleixner
2008-10-05 15:05                   ` Arjan van de Ven
2008-10-05 15:55                     ` Thomas Gleixner
2008-10-05 16:02                       ` Arjan van de Ven
2008-10-05 16:16                         ` Thomas Gleixner
2008-10-05 17:01                           ` Arjan van de Ven

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).