public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* 1352 NUL bytes at the end of a page?
@ 2004-05-13 19:08 Andy Isaacson
  2004-05-14  2:22 ` Andrew Morton
  0 siblings, 1 reply; 27+ messages in thread
From: Andy Isaacson @ 2004-05-13 19:08 UTC (permalink / raw)
  To: linux-kernel

We've got a user who's reporting BK problems which we've traced down to
the fact that his s.ChangeSet file has a hole, filled with '\0' bytes,
that's so far always 1352 bytes long, and the end is page-aligned.  (In
fact, the two cases we've seen so far have been 8k-aligned.)  The
correct file data picks up again after the hole.

bk is writing the data using stdio in a child process (fork, exec,
wait), then mmaping the result.  The corruption is persistent; he sent
us the s.ChangeSet file and there it was (not a cache or buffering
problem, therefore).

2.6.6-bk (current head of tree from whenever he built), UP PIII, symptom
observed on both ext3 and reiserfs.  (However, we've explicitly verified
the hole only on reiser.)

The problem is intermittent, having happened "several" times over the
last few months, and doesn't appear to be tied to any particular kernel
version.

To me, this looks awfully close to an Ethernet frame size... but that's
just a wild guess.  And I don't think he's running Ethernet (still
waiting for dmesg and .config).

I've asked for more info, memtest86, and will attempt to reproduce it on
another box.

Does anyone have insight into this peculiar problem?

-andy

^ permalink raw reply	[flat|nested] 27+ messages in thread
* Re: 1352 NUL bytes at the end of a page?
@ 2004-05-14  4:32 Steven Cole
  0 siblings, 0 replies; 27+ messages in thread
From: Steven Cole @ 2004-05-14  4:32 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andy Isaacson, Linux Kernel

Andrew Morton wrote:
>Andy Isaacson <adi@bitmover.com> wrote:
>>
>>  We've got a user who's reporting BK problems which we've traced down to
>>  the fact that his s.ChangeSet file has a hole, filled with '\0' bytes,
>>  that's so far always 1352 bytes long, and the end is page-aligned.  (In
>>  fact, the two cases we've seen so far have been 8k-aligned.)  The
>>  correct file data picks up again after the hole.
>
>When the reporter has a PIII machine it's often useful to find out the clock
>frequency - the lower it is, the older it is and the more likely it is that
>some component has rotted.
>
>If this one cannot be reproduced on any other machine I'd say it's a
>hardware failure.

Hi Andrew,

The user is me.  The machine is a 450 Mhz P-III, about five years old now.
Andy mentioned ethernet, but I don't have that here, just 56k dialup.  The
extra information he requested was sent a couple of hours ago, and in the
meantime I ran two full passes of memtest86 3.1 with zero errors.

<slight detour>
I do occasionally have problems with pppd, and the following message always
appears in /var/log/messages:

May 13 18:09:30 spc kernel: serial8250: too much work for irq10
May 13 18:09:30 spc kernel: serial8250: too much work for irq10

The message is always doubled as above.  This has never yet occurred
at the same time as the bk failure, so the two seem unrelated.  I have
to kill -9 the pppd process and reconnect when the above happens.
This problem never happened with a 2.4.x kernel, and was first detected
during the middle of 2.5.x development.
</slight detour>

The only reason the above was at all possibly relevant to the bk failtures,
is that I've only noticed the failures when pulling over the net via ppp.
I've never gotten the failure when pulling from another repository
on the same disk (I've only got one).

If you have any ideas about narrowing down the potentially rotted
component, please let me know.

I cut and pasted the above from a lkml archive, so sorry if this
messes up your mail thread.  I'm not on lkml here at home, so
please cc me on any replies.

Steven

^ permalink raw reply	[flat|nested] 27+ messages in thread
[parent not found: <200405131723.15752.elenstev@mesatop.com>]
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
@ 2004-05-17  3:36 Steven Cole
  2004-05-17  5:17 ` Linus Torvalds
  0 siblings, 1 reply; 27+ messages in thread
From: Steven Cole @ 2004-05-17  3:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Larry McVoy, Andrew Morton, adi, scole, support, linux-kernel

On Sunday 16 May 2004 08:42 pm, Linus Torvalds wrote:
> 
> On Sun, 16 May 2004, Larry McVoy wrote:
> > 
> > Be aware that how BK does I/O is with write() on the way out but with 
> > mmap on the way in.  The process which forked renumber has just written
> > the file and the renumber process is reading it with mmap.
> > 
> > If there are still any problems with mixing read/write and mmap then that
> > may be a prolem but I would have expected to see things start going 
> > wrong on a page boundary and the one core dump I saw was page aligned 
> > at the tail but not at the head, it started in the middle of the page.
> 
> The kernel should have no problems with mixed read/write and mmap usage, 
> although user space obviously needs to synchronize the accesses on its own 
> some way. There is no implicit synchronization otherwise, and the mmap 
> user can see a partial write at any stage of the write.
> 
> Some architectures may have cache coherency issues that makes this
> "interesting", but that's not the case on x86 (or indeed anything else
> remotely sane - virtual caches are just stupid in this day and age).
> 
> > I've told my team to drop this unless someone can show that it happens
> > on other kernels, this smells like a kernel bug to me, if it were a BK
> > bug we should have been getting hundreds of complaints by now.  We can
> > jump back on it if need be, let us know if you think it is a BK problem
> > after all.
> 
> Yeah, I agree. The only other possibility I see is that BK just doesn't
> synchronize, and expects writes to be atomically visible to other
> processes. They aren't. Preemption might just make this a whole lot more 
> visible, but on the other hand, so should SMP, so this sounds unlikely.
> 
> 		Linus
> 
> 
Larry, Linus,

I beat on this for last day without PREEMPT and no failures at all.
Several kernels, rock solid all.
Rebooted with an current (as of a couple hours ago) kernel and PREEMPT=y,
and after about the third pull into a repository (I have several), splaaat!

Here are the symptoms.  Same message from bk as usual:
---------------------------------------------------------------------------
takepatch: saved entire patch in PENDING/2004-05-16.01
---------------------------------------------------------------------------
Applying  15 revisions to ChangeSet renumber: can't read SCCS info in "RESYNC/SCCS/s.ChangeSet".
bk: takepatch.c:1343: applyCsetPatch: Assertion `s && s->tree' failed.
10586 bytes uncompressed to 52074, 4.92X expansion

One of Larry's guys, Rick Smith, sent me a little program (the source
is earlier in this thread) to check for null.  I called its executable
saga (see subject line).

[steven@spc SCCS]$ saga <s.ChangeSet
Found null start 0x1550b01 end 0x1551000 len 0x4ff line 535587
Found null start 0x2030b01 end 0x2031000 len 0x4ff line 639039
Found null start 0x2330b01 end 0x2331000 len 0x4ff line 663611

That was in the testing-2.6/RESYNC/SCCS directory of course.

OK, no more CONFIG_PREEMPT for me.  And, ppp failed earlier with:
serial8250: too much work for irq10.  That did not happen without
CONFIG_PREEMPT.

I reconnected to my ISP, bk pulled into my main testing repository, 
and that's when I got the splaaat.

Steven

^ permalink raw reply	[flat|nested] 27+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
@ 2004-05-18 14:38 Linus Torvalds
  2004-05-19 10:53 ` Steven Cole
  0 siblings, 1 reply; 27+ messages in thread
From: Linus Torvalds @ 2004-05-18 14:38 UTC (permalink / raw)
  To: Steven Cole
  Cc: Andrew Morton, Larry McVoy, mason, wli, hugh, adi, support,
	linux-kernel



On Mon, 17 May 2004, Steven Cole wrote:
>
> No problems, and with PREEMPT of course.

Ok. Good. It's a small data-set, but the bug made sense, and so did the 
fix.

> > If you see a failure on ext3, please try to analyze the corruption pattern 
> > again. It might be something different.
> 
> So, I take it that I should revert that one-liner if I want to get any failure data?
> With it, ext3 was pretty solid for this testing.

Yes. That one-liner is bogus. It was a good way to test a hypothesis for
the common case of a filesystem that uses the block_write_full_page thing
(and reiser is one of the few that doesn't), but it wasn't the real fix.
The reiser patch was the real fix for the problem on reiser, but ext3
should have been ok already. It uses (through a lot of other functions)
generic_file_aio_write_nolock() as the real write engine, and that one
calls "commit_write()" with the page lock held.

		Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2004-05-19 14:46 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-05-13 19:08 1352 NUL bytes at the end of a page? Andy Isaacson
2004-05-14  2:22 ` Andrew Morton
  -- strict thread matches above, loose matches on Subject: below --
2004-05-14  4:32 Steven Cole
     [not found] <200405131723.15752.elenstev@mesatop.com>
2004-05-14 16:53 ` 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.) Andy Isaacson
2004-05-15  0:54   ` Steven Cole
2004-05-15  1:55     ` 1352 NUL bytes at the end of a page? Wayne Scott
2004-05-17  3:36 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.) Steven Cole
2004-05-17  5:17 ` Linus Torvalds
2004-05-17  6:11   ` Andrew Morton
2004-05-17 13:56     ` 1352 NUL bytes at the end of a page? Wayne Scott
2004-05-17 15:17       ` Theodore Ts'o
2004-05-17 15:20         ` Larry McVoy
2004-05-17 15:22         ` Linus Torvalds
2004-05-17 15:25           ` Larry McVoy
2004-05-17 15:37           ` viro
2004-05-17 17:30             ` Steven Cole
2004-05-17 17:40               ` viro
2004-05-17 17:39                 ` Steven Cole
2004-05-17 19:06                   ` viro
2004-05-17 15:40           ` Arjan van de Ven
2004-05-17 15:53             ` Steven Cole
2004-05-17 16:23         ` Davide Libenzi
2004-05-17 16:28           ` Davide Libenzi
2004-05-18 14:38 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.) Linus Torvalds
2004-05-19 10:53 ` Steven Cole
2004-05-19 12:10   ` Chris Mason
2004-05-19 12:20     ` 1352 NUL bytes at the end of a page? Wayne Scott
2004-05-19 12:42       ` Nick Piggin
2004-05-19 13:28         ` Steven Cole
2004-05-19 13:36           ` Chris Mason
2004-05-19 13:59             ` Steven Cole
2004-05-19 14:03               ` Wayne Scott
2004-05-19 14:08               ` Chris Mason
2004-05-19 14:20                 ` Steven Cole
2004-05-19 14:45                 ` Steven Cole

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox