Data corruption problem

All of lore.kernel.org
 help / color / mirror / Atom feed

* Data corruption problem
@ 2011-02-11  5:14 Wayne Walker
       [not found] ` <20110211051458.GD27051-7+hyfkrzchDWTcdHvfGLfFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Wayne Walker @ 2011-02-11  5:14 UTC (permalink / raw)
  To: linux-cifs-u79uwXL29TY76Z2rM5mHXA

First, I'm not certain whether this is samba, the linux cifs driver, or
something else.

During testing, one of my QA guys was running an inhouse program that
generates pseudo-random, but fully recreatable, data and writes it to
a file, the file is named with a name that is essentially the seed to
the pseudo- random stream, so, given a filename, it can read the file
and verify that the data is correct.

The file he created was on a CentOS 5.5 machine that was mounting a cifs
share on another CentOS 5.5 host running samba.  After 150K individual
files from 35 bytes to 9 GB, he created a 9 GB file that failed
validation.  He ran the test again with the same seed and it succeeded.
He ran it a 3rd time and it failed again.

He got me involved.  I found no useful messages (cifs, IO, kernel mem,
network, or samba) in any logs on client or server anywhere near the
times of the file creations.

I cmp'd the files.  Then used "od -A x -t a" with offsets and diffed the
3 files.  Each of the 2 failed files has a single block of 56K (57344) nuls.
The 2 failed files have these at different points in the 2 files.  Each
56K nul block starts on an offset where x % 57344 == 0.

first file:
>>> 519995392 / 57344.
9068.0 # matching 56K blocks before the one null 56K block

second file is certainly on a 1 K boundary, but I mislaid the diff data
for it and it's taking forever for cmp to run to find the offset and
verify that it's on a 56K boundary.  I'll follow up to this email
tomorrow with the result of the cmp.

So, I searched the kernel source, expecting to find 56K in the sata
driver code.  Instead the only place I found it that seemed relevant
was:

	./fs/cifs/README:  wsize default write size (default 57344)

I have since used cp to copy the file 4 times with tcpdump running at
both ends.  All 4 times have worked properly.  Don't know if that is
because tcpdump is slowing it down or if our test app could be at fault.
Our test app is talking to the local file system and not with a block
size of 56K, so I don't think it is our app.

Unfortunately, the tcpdumps at both ends are reporting the kernel
dropping about 50% of the packets, so even if I can get it to fail,
I'm still  unsure whether it's the client or the samba server, where
client would still leave me choosing betweem our app and fs/cifs.

The only other thing I can think of is the ethernet devices, but since
the packet is made up of 30+ ethernet frames, and being TCP there is
a payload checksum, I can't see the network layers being the culprit,
but just in case:

client w/ fs/cifs:
04:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5721 Gigabit Ethernet PCI Express (rev 11)

samba server:
01:01.0 Ethernet controller: Intel Corporation 82547GI Gigabit Ethernet Controller
03:02.0 Ethernet controller: Intel Corporation 82541GI Gigabit Ethernet Controller

A few questions:

0. Anyone already know of a bug in fs/cifs or samba that has this
symptom?

1. Anyone know how to get the kernel to not drop the packets?

2. Any other ideas on what I can do to gather more data to differentiate
between bad-app, fs/cifs, samba, or other-element-in-the-chain?

Thank you for all the work you guys do!

-- 

Wayne Walker
wwalker-7+hyfkrzchDWTcdHvfGLfFaTQe2KTcn/@public.gmane.org
(512) 633-8076
Senior Consultant
Solid Constructs, LLC

> A: Because it messes up the order in which people normally read text.
> > Q: Why is top-posting such a bad thing?
> > > A: Top-posting.
> > > > Q: What is the most annoying thing in e-mail?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Data corruption problem
       [not found] ` <20110211051458.GD27051-7+hyfkrzchDWTcdHvfGLfFaTQe2KTcn/@public.gmane.org>
@ 2011-02-11  5:21   ` Wayne Walker
  2011-02-11 11:53   ` Jeff Layton
  2011-02-18 18:30   ` Wayne Walker
  2 siblings, 0 replies; 9+ messages in thread
From: Wayne Walker @ 2011-02-11  5:21 UTC (permalink / raw)
  To: linux-cifs-u79uwXL29TY76Z2rM5mHXA

On Thu, Feb 10, 2011 at 11:14:59PM -0600, Wayne Walker wrote:
> first file:
> >>> 519995392 / 57344.
> 9068.0 # matching 56K blocks before the one null 56K block
> 
> second file is certainly on a 1 K boundary, but I mislaid the diff data
> for it and it's taking forever for cmp to run to find the offset and
> verify that it's on a 56K boundary.  I'll follow up to this email
> tomorrow with the result of the cmp.

>>> 7910088704/57344.
137941.0

the second file's error is also at a 56K boundary.

-- 

Wayne Walker
wwalker-7+hyfkrzchDWTcdHvfGLfFaTQe2KTcn/@public.gmane.org
(512) 633-8076
Senior Consultant
Solid Constructs, LLC

> A: Because it messes up the order in which people normally read text.
> > Q: Why is top-posting such a bad thing?
> > > A: Top-posting.
> > > > Q: What is the most annoying thing in e-mail?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Data corruption problem
       [not found] ` <20110211051458.GD27051-7+hyfkrzchDWTcdHvfGLfFaTQe2KTcn/@public.gmane.org>
  2011-02-11  5:21   ` Wayne Walker
@ 2011-02-11 11:53   ` Jeff Layton
       [not found]     ` <20110211065318.62f91a5b-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
  2011-02-18 18:30   ` Wayne Walker
  2 siblings, 1 reply; 9+ messages in thread
From: Jeff Layton @ 2011-02-11 11:53 UTC (permalink / raw)
  To: Wayne Walker; +Cc: linux-cifs-u79uwXL29TY76Z2rM5mHXA

On Thu, 10 Feb 2011 23:14:59 -0600
Wayne Walker <wwalker-7+hyfkrzchDWTcdHvfGLfFaTQe2KTcn/@public.gmane.org> wrote:

> First, I'm not certain whether this is samba, the linux cifs driver, or
> something else.
> 
> During testing, one of my QA guys was running an inhouse program that
> generates pseudo-random, but fully recreatable, data and writes it to
> a file, the file is named with a name that is essentially the seed to
> the pseudo- random stream, so, given a filename, it can read the file
> and verify that the data is correct.
> 
> The file he created was on a CentOS 5.5 machine that was mounting a cifs
> share on another CentOS 5.5 host running samba.  After 150K individual
> files from 35 bytes to 9 GB, he created a 9 GB file that failed
> validation.  He ran the test again with the same seed and it succeeded.
> He ran it a 3rd time and it failed again.
> 
> He got me involved.  I found no useful messages (cifs, IO, kernel mem,
> network, or samba) in any logs on client or server anywhere near the
> times of the file creations.
> 
> I cmp'd the files.  Then used "od -A x -t a" with offsets and diffed the
> 3 files.  Each of the 2 failed files has a single block of 56K (57344) nuls.
> The 2 failed files have these at different points in the 2 files.  Each
> 56K nul block starts on an offset where x % 57344 == 0.
> 
> first file:
> >>> 519995392 / 57344.
> 9068.0 # matching 56K blocks before the one null 56K block
> 
> second file is certainly on a 1 K boundary, but I mislaid the diff data
> for it and it's taking forever for cmp to run to find the offset and
> verify that it's on a 56K boundary.  I'll follow up to this email
> tomorrow with the result of the cmp.
> 
> So, I searched the kernel source, expecting to find 56K in the sata
> driver code.  Instead the only place I found it that seemed relevant
> was:
> 
> 	./fs/cifs/README:  wsize default write size (default 57344)
> 
> I have since used cp to copy the file 4 times with tcpdump running at
> both ends.  All 4 times have worked properly.  Don't know if that is
> because tcpdump is slowing it down or if our test app could be at fault.
> Our test app is talking to the local file system and not with a block
> size of 56K, so I don't think it is our app.
> 
> Unfortunately, the tcpdumps at both ends are reporting the kernel
> dropping about 50% of the packets, so even if I can get it to fail,
> I'm still  unsure whether it's the client or the samba server, where
> client would still leave me choosing betweem our app and fs/cifs.
> 
> The only other thing I can think of is the ethernet devices, but since
> the packet is made up of 30+ ethernet frames, and being TCP there is
> a payload checksum, I can't see the network layers being the culprit,
> but just in case:
> 
> client w/ fs/cifs:
> 04:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5721 Gigabit Ethernet PCI Express (rev 11)
> 
> samba server:
> 01:01.0 Ethernet controller: Intel Corporation 82547GI Gigabit Ethernet Controller
> 03:02.0 Ethernet controller: Intel Corporation 82541GI Gigabit Ethernet Controller
> 
> A few questions:
> 
> 0. Anyone already know of a bug in fs/cifs or samba that has this
> symptom?
> 
> 1. Anyone know how to get the kernel to not drop the packets?
> 
> 2. Any other ideas on what I can do to gather more data to differentiate
> between bad-app, fs/cifs, samba, or other-element-in-the-chain?
> 
> Thank you for all the work you guys do!
> 

Did the close() or fsync() call return an error?

-- 
Jeff Layton <jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Data corruption problem
       [not found]     ` <20110211065318.62f91a5b-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
@ 2011-02-11 14:35       ` Wayne Walker
       [not found]         ` <20110211143520.GI27051-7+hyfkrzchDWTcdHvfGLfFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Wayne Walker @ 2011-02-11 14:35 UTC (permalink / raw)
  To: Jeff Layton; +Cc: linux-cifs-u79uwXL29TY76Z2rM5mHXA

On Fri, Feb 11, 2011 at 06:53:18AM -0500, Jeff Layton wrote:
> On Thu, 10 Feb 2011 23:14:59 -0600
> Wayne Walker <wwalker-7+hyfkrzchDWTcdHvfGLfFaTQe2KTcn/@public.gmane.org> wrote:
> 
> > First, I'm not certain whether this is samba, the linux cifs driver, or
> > something else.
> > 
> > He got me involved.  I found no useful messages (cifs, IO, kernel mem,
> > network, or samba) in any logs on client or server anywhere near the
> > times of the file creations.
> 
> Did the close() or fsync() call return an error?

No, Jeff.  Nothing in the logs and from the user space side there are
no errors.  All debug levels are at the system defaults.  After 12
attempts last night I've still not reproduced.  I will work with my QA
guy this morning (CST here) to see if we can reproduce.  What can I do
to gather the best data for you guys?

-- 

Wayne Walker
wwalker-7+hyfkrzchDWTcdHvfGLfFaTQe2KTcn/@public.gmane.org
(512) 633-8076
Senior Consultant
Solid Constructs, LLC

> A: Because it messes up the order in which people normally read text.
> > Q: Why is top-posting such a bad thing?
> > > A: Top-posting.
> > > > Q: What is the most annoying thing in e-mail?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Data corruption problem
       [not found]         ` <20110211143520.GI27051-7+hyfkrzchDWTcdHvfGLfFaTQe2KTcn/@public.gmane.org>
@ 2011-02-11 14:41           ` Jeff Layton
       [not found]             ` <20110211094117.1f012cae-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Jeff Layton @ 2011-02-11 14:41 UTC (permalink / raw)
  To: Wayne Walker; +Cc: linux-cifs-u79uwXL29TY76Z2rM5mHXA

On Fri, 11 Feb 2011 08:35:20 -0600
Wayne Walker <wwalker-7+hyfkrzchDWTcdHvfGLfFaTQe2KTcn/@public.gmane.org> wrote:

> On Fri, Feb 11, 2011 at 06:53:18AM -0500, Jeff Layton wrote:
> > On Thu, 10 Feb 2011 23:14:59 -0600
> > Wayne Walker <wwalker-7+hyfkrzchDWTcdHvfGLfFaTQe2KTcn/@public.gmane.org> wrote:
> > 
> > > First, I'm not certain whether this is samba, the linux cifs driver, or
> > > something else.
> > > 
> > > He got me involved.  I found no useful messages (cifs, IO, kernel mem,
> > > network, or samba) in any logs on client or server anywhere near the
> > > times of the file creations.
> > 
> > Did the close() or fsync() call return an error?
> 
> No, Jeff.  Nothing in the logs and from the user space side there are
> no errors.  All debug levels are at the system defaults.  After 12
> attempts last night I've still not reproduced.  I will work with my QA
> guy this morning (CST here) to see if we can reproduce.  What can I do
> to gather the best data for you guys?
> 

To be clear...are you sure that the close(2) or fsync(2) syscalls did
not return an error? It's a common bug for programs to ignore the return
code from close(2), and that's where errors during writeback get
reported.

The reason I ask is that there were some issues that were fixed
recently in mainline with cifs writeback. The CIFS code treated
timeouts during writeback as hard errors instead of retrying them, but
in those cases the client should have returned an error during fsync or
close.

-- 
Jeff Layton <jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: UNS: Re: Data corruption problem
       [not found]             ` <20110211094117.1f012cae-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
@ 2011-02-11 15:00               ` Wayne Walker
  0 siblings, 0 replies; 9+ messages in thread
From: Wayne Walker @ 2011-02-11 15:00 UTC (permalink / raw)
  To: Jeff Layton; +Cc: linux-cifs-u79uwXL29TY76Z2rM5mHXA

On Fri, Feb 11, 2011 at 09:41:17AM -0500, Jeff Layton wrote:
> > No, Jeff.  Nothing in the logs and from the user space side there are
> > no errors.  All debug levels are at the system defaults.  After 12
> > attempts last night I've still not reproduced.  I will work with my QA
> > guy this morning (CST here) to see if we can reproduce.  What can I do
> > to gather the best data for you guys?
> > 
> 
> To be clear...are you sure that the close(2) or fsync(2) syscalls did
> not return an error? It's a common bug for programs to ignore the return
> code from close(2), and that's where errors during writeback get
> reported.

Gotcha, I will grab a java dev and code review the test app.  My java-fu
is weak, but I think java will throw an exception on either failure,
just have to make sure some dev didn't put in a wide open catch.

> The reason I ask is that there were some issues that were fixed
> recently in mainline with cifs writeback. The CIFS code treated
> timeouts during writeback as hard errors instead of retrying them, but
> in those cases the client should have returned an error during fsync or
> close.

Good to know.  Thank you.

-- 

Wayne Walker
wwalker-7+hyfkrzchDWTcdHvfGLfFaTQe2KTcn/@public.gmane.org
(512) 633-8076
Senior Consultant
Solid Constructs, LLC

> A: Because it messes up the order in which people normally read text.
> > Q: Why is top-posting such a bad thing?
> > > A: Top-posting.
> > > > Q: What is the most annoying thing in e-mail?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Data corruption problem
       [not found] ` <20110211051458.GD27051-7+hyfkrzchDWTcdHvfGLfFaTQe2KTcn/@public.gmane.org>
  2011-02-11  5:21   ` Wayne Walker
  2011-02-11 11:53   ` Jeff Layton
@ 2011-02-18 18:30   ` Wayne Walker
       [not found]     ` <20110218183003.GF25484-7+hyfkrzchDWTcdHvfGLfFaTQe2KTcn/@public.gmane.org>
  2 siblings, 1 reply; 9+ messages in thread
From: Wayne Walker @ 2011-02-18 18:30 UTC (permalink / raw)
  To: linux-cifs-u79uwXL29TY76Z2rM5mHXA

On Thu, Feb 10, 2011 at 11:14:59PM -0600, Wayne Walker wrote:
> First, I'm not certain whether this is samba, the linux cifs driver, or
> something else.
> 
> During testing, one of my QA guys was running an inhouse program that
> generates pseudo-random, but fully recreatable, data and writes it to
> a file, the file is named with a name that is essentially the seed to
> the pseudo- random stream, so, given a filename, it can read the file
> and verify that the data is correct.
... snip ...

So, my QA guy has repeated the failure - 93 times, only from a linux box, so it appears to definitely be a cifs driver issue.

What can I do to gather useful info?  tcpdump on both client and server drop too many packets to be useful.

    A couple weeks ago, when running my data generator, I ran into a data corruption problem when creating a ~8GB file using `dp'. Based on an analysis that Wayne performed, he concluded that this problem is likely a CIFS/Samba bug. Since then, I created a test environment that now writes data to a disk array from 3 clients (2 Windows & 1 Linux). Yesterday, I ran a job that writes 500GB of data spread across ~11,000 files. I used `dp' to read back each file and verify the data, and it found 93 corrupt files. 

    Here are the results: http://qatest-sp/ui/index_archive_node.php/results/data_generator_test_detail/89

    A couple of things to note:

    All the corrupt files were created on the Linux host `acorn'. None were from the Windows boxes 
    The size of the corrupt files range from 350K to ~1 GB 
     
    This time, I am able to see additional log messages that I did not see last time (perhaps since I did not reboot the machines).

    From the Samba server (CentOS 5.5 samba-3.0.33-3.29.el5_5.1, hostname: snape):

    [2011/02/17 18:20:41, 0] lib/util_sock.c:write_data(562)
      write_data: write failure in writing to client 192.168.20.155. Error Broken pipe
    [2011/02/17 18:20:41, 0] lib/util_sock.c:send_smb(761)
      Error writing 55 bytes to client. -1. (Broken pipe)
    [2011/02/17 18:20:41, 1] smbd/service.c:close_cnum(1274)
      192.168.20.155 (192.168.20.155) closed connection to service data2
    [2011/02/17 18:20:41, 1] smbd/service.c:close_cnum(1274)
      192.168.20.155 (192.168.20.155) closed connection to service data2
    [2011/02/17 18:20:41, 1] smbd/service.c:make_connection_snum(1077)
      192.168.20.155 (192.168.20.155) connect to service data2 initially as user root (uid=0, gid=0) (pid 5312)

    From a Linux client (hostname: acorn):
    Feb 17 16:54:30 acorn kernel:  CIFS VFS: Write2 ret -11, wrote 0
    Feb 17 16:57:10 acorn kernel:  CIFS VFS: No response to cmd 47 mid 46382
    Feb 17 16:57:10 acorn kernel:  CIFS VFS: Write2 ret -11, wrote 0
    Feb 17 16:57:16 acorn kernel:  CIFS VFS: Write2 ret -11, wrote 0
    Feb 17 16:57:31 acorn kernel:  CIFS VFS: No response for cmd 50 mid 46388
    Feb 17 16:59:52 acorn kernel:  CIFS VFS: No response to cmd 47 mid 64873
    Feb 17 16:59:52 acorn kernel:  CIFS VFS: Write2 ret -11, wrote 0
    Feb 17 16:59:53 acorn kernel:  CIFS VFS: Write2 ret -11, wrote 0
 
-- 

Wayne Walker
wwalker-7+hyfkrzchDWTcdHvfGLfFaTQe2KTcn/@public.gmane.org
(512) 633-8076
Senior Consultant
Solid Constructs, LLC

> A: Because it messes up the order in which people normally read text.
> > Q: Why is top-posting such a bad thing?
> > > A: Top-posting.
> > > > Q: What is the most annoying thing in e-mail?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Data corruption problem
       [not found]     ` <20110218183003.GF25484-7+hyfkrzchDWTcdHvfGLfFaTQe2KTcn/@public.gmane.org>
@ 2011-02-18 20:45       ` Jeff Layton
       [not found]         ` <20110218154552.7cf091a8-4QP7MXygkU+dMjc06nkz3ljfA9RmPOcC@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Jeff Layton @ 2011-02-18 20:45 UTC (permalink / raw)
  To: Wayne Walker; +Cc: linux-cifs-u79uwXL29TY76Z2rM5mHXA

On Fri, 18 Feb 2011 12:30:04 -0600
Wayne Walker <wwalker-7+hyfkrzchDWTcdHvfGLfFaTQe2KTcn/@public.gmane.org> wrote:

> On Thu, Feb 10, 2011 at 11:14:59PM -0600, Wayne Walker wrote:
> > First, I'm not certain whether this is samba, the linux cifs driver, or
> > something else.
> > 
> > During testing, one of my QA guys was running an inhouse program that
> > generates pseudo-random, but fully recreatable, data and writes it to
> > a file, the file is named with a name that is essentially the seed to
> > the pseudo- random stream, so, given a filename, it can read the file
> > and verify that the data is correct.
> ... snip ...
> 
> So, my QA guy has repeated the failure - 93 times, only from a linux box, so it appears to definitely be a cifs driver issue.
> 
> What can I do to gather useful info?  tcpdump on both client and server drop too many packets to be useful.
> 

I asked before, but I don't think you ever gave a conclusive answer...

Did the kernel report an error when you did a fsync() or close()? I
suspect that it did, but sadly a lot of programs don't bother to check
for that (usually because they're not really able to deal with it).

>     From a Linux client (hostname: acorn):
>     Feb 17 16:54:30 acorn kernel:  CIFS VFS: Write2 ret -11, wrote 0
>     Feb 17 16:57:10 acorn kernel:  CIFS VFS: No response to cmd 47 mid 46382
>     Feb 17 16:57:10 acorn kernel:  CIFS VFS: Write2 ret -11, wrote 0
>     Feb 17 16:57:16 acorn kernel:  CIFS VFS: Write2 ret -11, wrote 0
>     Feb 17 16:57:31 acorn kernel:  CIFS VFS: No response for cmd 50 mid 46388
>     Feb 17 16:59:52 acorn kernel:  CIFS VFS: No response to cmd 47 mid 64873
>     Feb 17 16:59:52 acorn kernel:  CIFS VFS: Write2 ret -11, wrote 0
>     Feb 17 16:59:53 acorn kernel:  CIFS VFS: Write2 ret -11, wrote 0
>  

Those mean that calls to the server were occasionally timing out.
That's not terribly unusual under heavy load. Until very recently when
that happened, the kernel would treat that like a hard error and would
disconnect the socket.

You may want to test something more recent (like 2.6.38-rc5) to see if
the problems go away with that. Since you mention you're using CentOS
you could also open a bug at bugzilla.redhat.com and I'll try to look
at it when I get time.

If you have a RH support contract you may also want to open a support
case with this problem which would allow me to give it more priority.

Cheers,
-- 
Jeff Layton <jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: UNS: Re: Data corruption problem
       [not found]         ` <20110218154552.7cf091a8-4QP7MXygkU+dMjc06nkz3ljfA9RmPOcC@public.gmane.org>
@ 2011-02-18 21:49           ` Wayne Walker
  0 siblings, 0 replies; 9+ messages in thread
From: Wayne Walker @ 2011-02-18 21:49 UTC (permalink / raw)
  To: Jeff Layton; +Cc: linux-cifs-u79uwXL29TY76Z2rM5mHXA

On Fri, Feb 18, 2011 at 03:45:52PM -0500, Jeff Layton wrote:
> I asked before, but I don't think you ever gave a conclusive answer...
>
> Did the kernel report an error when you did a fsync() or close()? I
> suspect that it did, but sadly a lot of programs don't bother to check
> for that (usually because they're not really able to deal with it).

the write is in java and is FileOutputStream.write() which returns void
implying that any failure will come as a thrown exception, which is
caught 2 lines down and stack trace dumped and then break() is called so
we would immediately stop writing.  Since the files continue past the
bad data, no error occurs during write().

But, I just found this:

          try {
              datain.close();
              dataout.close();
          } catch (IOException e) {
              // Do nothing
          }

I'll fix the code and have the tests rerun over the weekend or Monday
and see if we get any exceptions from close().

> >     From a Linux client (hostname: acorn):
> >     Feb 17 16:54:30 acorn kernel:  CIFS VFS: Write2 ret -11, wrote 0
> >     Feb 17 16:57:10 acorn kernel:  CIFS VFS: No response to cmd 47 mid 46382
> >     Feb 17 16:57:10 acorn kernel:  CIFS VFS: Write2 ret -11, wrote 0
> >     Feb 17 16:57:16 acorn kernel:  CIFS VFS: Write2 ret -11, wrote 0
> >     Feb 17 16:57:31 acorn kernel:  CIFS VFS: No response for cmd 50 mid 46388
> >     Feb 17 16:59:52 acorn kernel:  CIFS VFS: No response to cmd 47 mid 64873
> >     Feb 17 16:59:52 acorn kernel:  CIFS VFS: Write2 ret -11, wrote 0
> >     Feb 17 16:59:53 acorn kernel:  CIFS VFS: Write2 ret -11, wrote 0
>
> Those mean that calls to the server were occasionally timing out.
> That's not terribly unusual under heavy load. Until very recently when
> that happened, the kernel would treat that like a hard error and would
> disconnect the socket.
>
> You may want to test something more recent (like 2.6.38-rc5) to see if
> the problems go away with that. Since you mention you're using CentOS
> you could also open a bug at bugzilla.redhat.com and I'll try to look
> at it when I get time.
>
> If you have a RH support contract you may also want to open a support
> case with this problem which would allow me to give it more priority.

Thank you.  I'll be back :)

--

Wayne Walker
wwalker-7+hyfkrzchDWTcdHvfGLfFaTQe2KTcn/@public.gmane.org
(512) 633-8076
Senior Consultant
Solid Constructs, LLC

> A: Because it messes up the order in which people normally read text.
> > Q: Why is top-posting such a bad thing?
> > > A: Top-posting.
> > > > Q: What is the most annoying thing in e-mail?

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2011-02-18 21:49 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-02-11  5:14 Data corruption problem Wayne Walker
     [not found] ` <20110211051458.GD27051-7+hyfkrzchDWTcdHvfGLfFaTQe2KTcn/@public.gmane.org>
2011-02-11  5:21   ` Wayne Walker
2011-02-11 11:53   ` Jeff Layton
     [not found]     ` <20110211065318.62f91a5b-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
2011-02-11 14:35       ` Wayne Walker
     [not found]         ` <20110211143520.GI27051-7+hyfkrzchDWTcdHvfGLfFaTQe2KTcn/@public.gmane.org>
2011-02-11 14:41           ` Jeff Layton
     [not found]             ` <20110211094117.1f012cae-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
2011-02-11 15:00               ` UNS: " Wayne Walker
2011-02-18 18:30   ` Wayne Walker
     [not found]     ` <20110218183003.GF25484-7+hyfkrzchDWTcdHvfGLfFaTQe2KTcn/@public.gmane.org>
2011-02-18 20:45       ` Jeff Layton
     [not found]         ` <20110218154552.7cf091a8-4QP7MXygkU+dMjc06nkz3ljfA9RmPOcC@public.gmane.org>
2011-02-18 21:49           ` UNS: " Wayne Walker

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.