SATA errors?

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* SATA errors?
@ 2008-10-01  9:25 Danilo Godec
  2008-10-01 10:08 ` Wolfgang Denk
  0 siblings, 1 reply; 10+ messages in thread
From: Danilo Godec @ 2008-10-01  9:25 UTC (permalink / raw)
  To: Linux RAID Mailing List

[-- Attachment #1: Type: text/plain, Size: 2303 bytes --]

Hi,

I've been searching the web, Google, mailing lists for a while now, but
can't really find the answer - so I'm hoping for a 'SATA guru' here...

On one of my server, I use 4 drive RAID5 Linux raid. One of the drives
keeps reporting these errors:

> Oct  1 10:11:29 bigxen2 kernel: ata1.00: exception Emask 0x10 SAct 0x0
> SErr 0x10002 action 0x2 frozen
> Oct  1 10:11:29 bigxen2 kernel: ata1.00: (irq_stat 0x04400000, PHY RDY
> changed)
> Oct  1 10:11:29 bigxen2 kernel: ata1.00: cmd
> 25/00:08:5c:be:60/00:00:14:00:00/e0 tag 0 cdb 0x0 data 4096 in
> Oct  1 10:11:29 bigxen2 kernel:          res
> 50/00:00:6b:6a:60/00:00:14:00:00/e0 Emask 0x10 (ATA bus error)
> Oct  1 10:11:30 bigxen2 kernel: ata1: waiting for device to spin up (7
> secs)
> Oct  1 10:11:40 bigxen2 kernel: ata1: soft resetting port
> Oct  1 10:11:41 bigxen2 kernel: ata1: softreset failed (1st FIS failed)
> Oct  1 10:11:41 bigxen2 kernel: ata1: softreset failed, retrying in 5 secs
> Oct  1 10:11:46 bigxen2 kernel: ata1: hard resetting port
> Oct  1 10:11:47 bigxen2 kernel: ata1: SATA link up 3.0 Gbps (SStatus
> 123 SControl 300)
> Oct  1 10:11:47 bigxen2 kernel: ata1.00: configured for UDMA/133
> Oct  1 10:11:47 bigxen2 kernel: ata1: EH complete
> Oct  1 10:11:47 bigxen2 kernel: SCSI device sda: 976771055 512-byte
> hdwr sectors (500107 MB)
> Oct  1 10:11:47 bigxen2 kernel: sda: Write Protect is off
> Oct  1 10:11:47 bigxen2 kernel: sda: Mode Sense: 00 3a 00 00
> Oct  1 10:11:47 bigxen2 kernel: SCSI device sda: drive cache: write back
As far as I can tell these happen randomly - sometimes it's 8 hours
between two, sometimes it's a couple of minutes. There are about 5-20 of
those per day, however Linux raid never kicks the drive out of the
array. There are also no other signs of drive not functioning properly
(such as filesystem corruption or similar).

Any ideas? Can anyone 'decode' the above errors?

 Thanks, Danilo

-- 
Danilo Godec, sistemska podpora

Predlog! Obiscite prenovljeno spletno stran http://www.agenda.si

ODPRTA KODA IN LINUX
STORITVE : POSLOVNE RESITVE : UPRAVLJANJE IT : INFRASTRUKTURA IT : IZOBRAZEVANJE : PROGRAMSKA OPREMA

Visit our updated web page at http://www.agenda.si

OPEN SOURCE AND LINUX
SERVICES : BUSINESS SOLUTIONS : IT MANAGEMENT : IT INFRASTRUCTURE : TRAINING : SOFTWARE 


[-- Attachment #2: danilo_godec.vcf --]
[-- Type: text/x-vcard, Size: 316 bytes --]

begin:vcard
fn:Danilo Godec
n:Godec;Danilo
org:Agenda OpenSystems
adr:;;Ul. Pohorskega bataljona 49;Maribor;;2000;Slovenia
email;internet:danilo.godec@agenda.si
tel;work:+386 2 4216138
tel;fax:+386 2 4216141
tel;cell:+386 40 618802
x-mozilla-html:FALSE
url:http://www.agenda.si/
version:2.1
end:vcard


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: SATA errors?
  2008-10-01  9:25 SATA errors? Danilo Godec
@ 2008-10-01 10:08 ` Wolfgang Denk
  2008-10-01 10:41   ` Danilo Godec
  0 siblings, 1 reply; 10+ messages in thread
From: Wolfgang Denk @ 2008-10-01 10:08 UTC (permalink / raw)
  To: Danilo Godec; +Cc: Linux RAID Mailing List

Dear Danilo,

In message <48E34221.1000008@agenda.si> you wrote:
> 
> I've been searching the web, Google, mailing lists for a while now, but
> can't really find the answer - so I'm hoping for a 'SATA guru' here...

I'm ot a guru, but I had similar previous experience.

> As far as I can tell these happen randomly - sometimes it's 8 hours
> between two, sometimes it's a couple of minutes. There are about 5-20 of
> those per day, however Linux raid never kicks the drive out of the
> array. There are also no other signs of drive not functioning properly
> (such as filesystem corruption or similar).
> 
> Any ideas? Can anyone 'decode' the above errors?

In my experience, problems like this are often casued by
broken/unreliable cables / connectors / backplanes.

As a first measure, try replugging the SATA cables.

If this doesn't help, try swapping arount the disks and cables to see
if the problem is with the cable (sticks with the disk) or  with  the
backplance (sticks with a physical port).

Then replace the faulty components.

Best regards,

Wolfgang Denk

-- 
DENX Software Engineering GmbH,     MD: Wolfgang Denk & Detlev Zundel
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd@denx.de
"There's always been Tower of Babel sort of  bickering  inside  Unix,
but  this  is the most extreme form ever. This means at least several
years of confusion." - Bill Gates, founder and chairman of Microsoft,
about the Open Systems Foundation

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: SATA errors?
  2008-10-01 10:08 ` Wolfgang Denk
@ 2008-10-01 10:41   ` Danilo Godec
  2008-10-01 11:48     ` David Lethe
  0 siblings, 1 reply; 10+ messages in thread
From: Danilo Godec @ 2008-10-01 10:41 UTC (permalink / raw)
  To: Wolfgang Denk; +Cc: Linux RAID Mailing List

[-- Attachment #1: Type: text/plain, Size: 1618 bytes --]

Wolfgang Denk pravi:
> In my experience, problems like this are often casued by
> broken/unreliable cables / connectors / backplanes.
>
> As a first measure, try replugging the SATA cables.
>
> If this doesn't help, try swapping arount the disks and cables to see
> if the problem is with the cable (sticks with the disk) or  with  the
> backplance (sticks with a physical port).
>   
That has been one of my ideas too and I have already checked and swapped
cables - but no success. I couldn't change the backplane as I don't have
a spare one.

But there is one thing though that crossed my mind just seconds after
I've hit the 'send' button!

I periodically check (every minute)  the drive temperature using
'smartctl -a' and I only query one drive - '/dev/sda'! So I re-checked
my log files and indeed - the SATA error ALWAYS happens within one
second of the 'smartctl' command (but not every time).

So now I changed the scripts to query a different drive ('/dev/sdb'). In
a couple of hours I should know if that was it...

  Thanks for the help, Danilo

PS: Oh, one more thing - the 'sda' drive is a WDC, while all the others
are Seagate.

> Then replace the faulty components.
>
> Best regards,
>
> Wolfgang Denk
>
>   

-- 
Danilo Godec, sistemska podpora

Predlog! Obiscite prenovljeno spletno stran http://www.agenda.si

ODPRTA KODA IN LINUX
STORITVE : POSLOVNE RESITVE : UPRAVLJANJE IT : INFRASTRUKTURA IT : IZOBRAZEVANJE : PROGRAMSKA OPREMA

Visit our updated web page at http://www.agenda.si

OPEN SOURCE AND LINUX
SERVICES : BUSINESS SOLUTIONS : IT MANAGEMENT : IT INFRASTRUCTURE : TRAINING : SOFTWARE 

[-- Attachment #2: danilo_godec.vcf --]
[-- Type: text/x-vcard, Size: 302 bytes --]

begin:vcard
fn:Danilo Godec
n:Godec;Danilo
org:Agenda OpenSystems
adr:;;Ul. Pohorskega bataljona 49;Maribor;;2000;Slovenia
email;internet:danilo.godec@agenda.si
tel;work:+386 2 4216138
tel;fax:+386 2 4216141
tel;cell:+386 40 618802
x-mozilla-html:FALSE
url:http://www.agenda.si/
version:2.1
end:vcard

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: SATA errors?
  2008-10-01 10:41   ` Danilo Godec
@ 2008-10-01 11:48     ` David Lethe
  2008-10-01 12:17       ` David Greaves
  0 siblings, 1 reply; 10+ messages in thread
From: David Lethe @ 2008-10-01 11:48 UTC (permalink / raw)
  To: Danilo Godec, Wolfgang Denk; +Cc: Linux RAID Mailing List



> -----Original Message-----
> From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
> owner@vger.kernel.org] On Behalf Of Danilo Godec
> Sent: Wednesday, October 01, 2008 5:41 AM
> To: Wolfgang Denk
> Cc: Linux RAID Mailing List
> Subject: Re: SATA errors?
> 
> Wolfgang Denk pravi:
> > In my experience, problems like this are often casued by
> > broken/unreliable cables / connectors / backplanes.
> >
> > As a first measure, try replugging the SATA cables.
> >
> > If this doesn't help, try swapping arount the disks and cables to
see
> > if the problem is with the cable (sticks with the disk) or  with
the
> > backplance (sticks with a physical port).
> >
> That has been one of my ideas too and I have already checked and
> swapped cables - but no success. I couldn't change the backplane as I
> don't have a spare one.
> 
> But there is one thing though that crossed my mind just seconds after
> I've hit the 'send' button!
> 
> I periodically check (every minute)  the drive temperature using
> 'smartctl -a' and I only query one drive - '/dev/sda'! So I re-checked
> my log files and indeed - the SATA error ALWAYS happens within one
> second of the 'smartctl' command (but not every time).
> 
> So now I changed the scripts to query a different drive ('/dev/sdb').
> In a couple of hours I should know if that was it...
> 
>   Thanks for the help, Danilo
> 
> PS: Oh, one more thing - the 'sda' drive is a WDC, while all the
others
> are Seagate.
> 
> 
> > Then replace the faulty components.
> >
> > Best regards,
> >
> > Wolfgang Denk
> >
> >
> 
> 
> --
> Danilo Godec, sistemska podpora
> 
> Predlog! Obiscite prenovljeno spletno stran http://www.agenda.si
> 
> ODPRTA KODA IN LINUX
> STORITVE : POSLOVNE RESITVE : UPRAVLJANJE IT : INFRASTRUKTURA IT :
> IZOBRAZEVANJE : PROGRAMSKA OPREMA
> 
> Visit our updated web page at http://www.agenda.si
> 
> OPEN SOURCE AND LINUX
> SERVICES : BUSINESS SOLUTIONS : IT MANAGEMENT : IT INFRASTRUCTURE :
> TRAINING : SOFTWARE
------------------------------------------------------------------------
----------

There is no cause of concern. The 0x25 command translates to
READ_CAPACITY10.  (i.e., how many blocks does the disk hold).  This
command is emulated because the disk doesn't natively speak SCSI
commands, which is how your specific hardware/driver/controller
combination configures such things.

Ignore the error.  There is a long story behind encapsulating SATA
commands in SCSI frames, why it is reasonable for this to happen, and
how the controller resolves obtaining the number of usable blocks.

It is not a cable or controller as other people suggested.  It isn't
even an "error".  It is a case of an application issuing a command to
see if it is supported. The disk returned appropriate data indicating
the command isn't supported, and all is right with the world. There are
other ways to obtain the drive capacity, which are clearly successful,
as you are putting data on the disk now.

David @ santools.com
-----------------------------------------------------------



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: SATA errors?
  2008-10-01 11:48     ` David Lethe
@ 2008-10-01 12:17       ` David Greaves
  2008-10-01 13:11         ` David Lethe
  0 siblings, 1 reply; 10+ messages in thread
From: David Greaves @ 2008-10-01 12:17 UTC (permalink / raw)
  To: David Lethe; +Cc: Danilo Godec, Wolfgang Denk, Linux RAID Mailing List

David Lethe wrote:
> There is no cause of concern. The 0x25 command translates to
> READ_CAPACITY10.  (i.e., how many blocks does the disk hold).  This
> command is emulated because the disk doesn't natively speak SCSI
> commands, which is how your specific hardware/driver/controller
> combination configures such things.

and yet look at the timestamps...


> Oct  1 10:11:30 bigxen2 kernel: ata1: waiting for device to spin up (7
> secs)
> Oct  1 10:11:40 bigxen2 kernel: ata1: soft resetting port
> Oct  1 10:11:41 bigxen2 kernel: ata1: softreset failed (1st FIS failed)
> Oct  1 10:11:41 bigxen2 kernel: ata1: softreset failed, retrying in 5 secs
> Oct  1 10:11:46 bigxen2 kernel: ata1: hard resetting port
> Oct  1 10:11:47 bigxen2 kernel: ata1: SATA link up 3.0 Gbps (SStatus
> 123 SControl 300)
> Oct  1 10:11:47 bigxen2 kernel: ata1.00: configured for UDMA/133
> Oct  1 10:11:47 bigxen2 kernel: ata1: EH complete

That looks to me like 15-17 seconds of unresponsive disk; certainly the time
around the resets are times when the driver isn't allowing disk access.

I'd say there was cause for something; although I'd cc the linux-ide group for
real insight, not linux-raid :)

David - maybe the response from the 0x25 command should not result in a reset -
or maybe the 0x25 should not be issued if it causes a state that does require a
reset.

I get similar softreset/hardreset problems with some samsung drives on some
controllers. I've not got round to investigating it yet. Sorry.

David


-- 
"Don't worry, you'll be fine; I saw it work in a cartoon once..."

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: SATA errors?
  2008-10-01 12:17       ` David Greaves
@ 2008-10-01 13:11         ` David Lethe
  2008-10-01 21:36           ` Danilo Godec
  0 siblings, 1 reply; 10+ messages in thread
From: David Lethe @ 2008-10-01 13:11 UTC (permalink / raw)
  To: David Greaves; +Cc: Danilo Godec, Wolfgang Denk, Linux RAID Mailing List

> -----Original Message-----
> From: David Greaves [mailto:david@dgreaves.com]
> Sent: Wednesday, October 01, 2008 7:18 AM
> To: David Lethe
> Cc: Danilo Godec; Wolfgang Denk; Linux RAID Mailing List
> Subject: Re: SATA errors?
> 
> David Lethe wrote:
> > There is no cause of concern. The 0x25 command translates to
> > READ_CAPACITY10.  (i.e., how many blocks does the disk hold).  This
> > command is emulated because the disk doesn't natively speak SCSI
> > commands, which is how your specific hardware/driver/controller
> > combination configures such things.
> 
> and yet look at the timestamps...
> 
> 
> > Oct  1 10:11:30 bigxen2 kernel: ata1: waiting for device to spin up
> (7
> > secs)
> > Oct  1 10:11:40 bigxen2 kernel: ata1: soft resetting port
> > Oct  1 10:11:41 bigxen2 kernel: ata1: softreset failed (1st FIS
> failed)
> > Oct  1 10:11:41 bigxen2 kernel: ata1: softreset failed, retrying in
5
> secs
> > Oct  1 10:11:46 bigxen2 kernel: ata1: hard resetting port
> > Oct  1 10:11:47 bigxen2 kernel: ata1: SATA link up 3.0 Gbps (SStatus
> > 123 SControl 300)
> > Oct  1 10:11:47 bigxen2 kernel: ata1.00: configured for UDMA/133
> > Oct  1 10:11:47 bigxen2 kernel: ata1: EH complete
> 
> That looks to me like 15-17 seconds of unresponsive disk; certainly
the
> time
> around the resets are times when the driver isn't allowing disk
access.
> 
> I'd say there was cause for something; although I'd cc the linux-ide
> group for
> real insight, not linux-raid :)
> 
> David - maybe the response from the 0x25 command should not result in
a
> reset -
> or maybe the 0x25 should not be issued if it causes a state that does
> require a
> reset.
> 
> I get similar softreset/hardreset problems with some samsung drives on
> some
> controllers. I've not got round to investigating it yet. Sorry.
> 
> David
> 
> 
> --
> "Don't worry, you'll be fine; I saw it work in a cartoon once..."
===========================================
Without spending a lot of time on this, gut feeling is that problem is
due to a weakly implemented drive capacity query logic.   Something
wants to know capacity of the drive, and when it doesn't get expected
results, it issues a brute-force reset, probably because it assumes
drive is locked up or something like that.  As the disks emulate SCSI
devices (due to the fact that SCSI commands are being sent, then you
have to look at whatever does the translation.  If you just want to do a
brute-force-can-I-fix-it, then look at the firmware for your RAID
controller first, then drivers.  Do not just upgrade them without
checking out potential compatibility problems, and the appropriate
vendor's support site.  

Since the problem isn't limited to the POST, then there is potential the
problem has nothing to do with your embedded controller/firmware.  You
could have an application program causing this.  Think about programs
you run that need to know how many blocks there are on a physical disk
drive.   See if you can disable them.  Certainly smartctl is one of
them.  All the fdisk family commands, mdadm, and RAID management
commands would need to know physical block counts at some point in time.

If this was my system, I would ...
1) First check into upgrading firmware/bios/drivers of disk controller.
2) Look at cron jobs and see if anything that needs capacity runs around
the time the errors are reported.  Something has to run to start this
off, so you need to find it.  
3) Use logger and a shell script to try to catch system in the 15 second
window when you have this problem, and see what programs are running. 
4) Actually, if this was my system, and if I/O wasn't actually being
suspended during those 15 seconds, then I probably would do step 1 only,
and if everything is current, then I would move on and not worry about
it.   Even if you find the offending program, then that doesn't mean
that the author of the program has or will make an acceptable change in
their code.  

The problem you have from SCSI perspective is the bozo who wrote this
chunk of code did it the wrong way. The CORRECT way to determine
addressable blocks is to send out the READCAP10, look for return value
of FFFFFFFFh blocks, then issue a 16-byte (0x9e 0x10) READ CAPACITY,
because you have > FFFFFFFE blocks on the disk.  This architect never
imagined that the READCAP10 would have to deal with large disks, and
assumed if there was a problem, then the disk needs to be reset.   

David @ SANtools.com
Storage Diagnostics Software
http://www.santools.com/smart/unix/manual

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: SATA errors?
  2008-10-01 13:11         ` David Lethe
@ 2008-10-01 21:36           ` Danilo Godec
  2008-10-01 22:03             ` Justin Piszcz
  0 siblings, 1 reply; 10+ messages in thread
From: Danilo Godec @ 2008-10-01 21:36 UTC (permalink / raw)
  To: David Lethe; +Cc: Linux RAID Mailing List

I don't want to start any holly wars, but I'm not using a RAID 
controller. It's just a plain old on-board SATA controller (at least 
that's what I think it is):

00:1f.2 SATA controller: Intel Corporation 631xESB/632xESB SATA Storage 
Controller AHCI (rev 09)

David Lethe wrote:
> If this was my system, I would ...
> 1) First check into upgrading firmware/bios/drivers of disk controller.
> 2) Look at cron jobs and see if anything that needs capacity runs around
> the time the errors are reported.  Something has to run to start this
> off, so you need to find it.  
> 3) Use logger and a shell script to try to catch system in the 15 second
> window when you have this problem, and see what programs are running. 
> 4) Actually, if this was my system, and if I/O wasn't actually being
> suspended during those 15 seconds, then I probably would do step 1 only,
> and if everything is current, then I would move on and not worry about
> it.   Even if you find the offending program, then that doesn't mean
> that the author of the program has or will make an acceptable change in
> their code.  
>   
1. I will get a new server in a couple of days and then I'll be able to 
move the Xen VM's from the 'problematic' server. Then I'll see what can 
be updated/upgraded.
2. The errors are pretty much random and there is nothing in the cron at 
all. I don't think Xen VM's could do anything with the physical drive, 
so their crons shouldn't be relevant.
3. It's not really a problem that we (the users) would feel. It's just 
the logs that got me worried (I don't like unexplainable hard drive errors).
4. As said before, I changed the scripts to use 'smartctl' with one of 
the other drives. So far it seems better - there hasn't been a single 
error in 12 hours.
> The problem you have from SCSI perspective is the bozo who wrote this
> chunk of code did it the wrong way. The CORRECT way to determine
> addressable blocks is to send out the READCAP10, look for return value
> of FFFFFFFFh blocks, then issue a 16-byte (0x9e 0x10) READ CAPACITY,
> because you have > FFFFFFFE blocks on the disk.  This architect never
> imagined that the READCAP10 would have to deal with large disks, and
> assumed if there was a problem, then the disk needs to be reset.  
If it turns out that 'smartctl' was causing this I'll report it to 
'smartmontools' guys.

Thanks for the help, Danilo


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: SATA errors?
  2008-10-01 21:36           ` Danilo Godec
@ 2008-10-01 22:03             ` Justin Piszcz
  2008-10-03  9:52               ` Danilo Godec
  0 siblings, 1 reply; 10+ messages in thread
From: Justin Piszcz @ 2008-10-01 22:03 UTC (permalink / raw)
  To: Danilo Godec; +Cc: David Lethe, Linux RAID Mailing List



On Wed, 1 Oct 2008, Danilo Godec wrote:

> I don't want to start any holly wars, but I'm not using a RAID controller. 
> It's just a plain old on-board SATA controller (at least that's what I think 
> it is):
>
> 00:1f.2 SATA controller: Intel Corporation 631xESB/632xESB SATA Storage 
> Controller AHCI (rev 09)
>
> David Lethe wrote:
>> If this was my system, I would ...
>> 1) First check into upgrading firmware/bios/drivers of disk controller.
>> 2) Look at cron jobs and see if anything that needs capacity runs around
>> the time the errors are reported.  Something has to run to start this
>> off, so you need to find it.  3) Use logger and a shell script to try to 
>> catch system in the 15 second
>> window when you have this problem, and see what programs are running. 4) 
>> Actually, if this was my system, and if I/O wasn't actually being
>> suspended during those 15 seconds, then I probably would do step 1 only,
>> and if everything is current, then I would move on and not worry about
>> it.   Even if you find the offending program, then that doesn't mean
>> that the author of the program has or will make an acceptable change in
>> their code. 
> 1. I will get a new server in a couple of days and then I'll be able to move 
> the Xen VM's from the 'problematic' server. Then I'll see what can be 
> updated/upgraded.
> 2. The errors are pretty much random and there is nothing in the cron at all. 
> I don't think Xen VM's could do anything with the physical drive, so their 
> crons shouldn't be relevant.
> 3. It's not really a problem that we (the users) would feel. It's just the 
> logs that got me worried (I don't like unexplainable hard drive errors).
> 4. As said before, I changed the scripts to use 'smartctl' with one of the 
> other drives. So far it seems better - there hasn't been a single error in 12 
> hours.
>> The problem you have from SCSI perspective is the bozo who wrote this
>> chunk of code did it the wrong way. The CORRECT way to determine
>> addressable blocks is to send out the READCAP10, look for return value
>> of FFFFFFFFh blocks, then issue a 16-byte (0x9e 0x10) READ CAPACITY,
>> because you have > FFFFFFFE blocks on the disk.  This architect never
>> imagined that the READCAP10 would have to deal with large disks, and
>> assumed if there was a problem, then the disk needs to be reset. 
> If it turns out that 'smartctl' was causing this I'll report it to 
> 'smartmontools' guys.
>
> Thanks for the help, Danilo
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

I actually turned off smart(daemon) /etc - the problems still persist for 
me..

Justin.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: SATA errors?
  2008-10-01 22:03             ` Justin Piszcz
@ 2008-10-03  9:52               ` Danilo Godec
  2008-10-03 16:13                 ` Iustin Pop
  0 siblings, 1 reply; 10+ messages in thread
From: Danilo Godec @ 2008-10-03  9:52 UTC (permalink / raw)
  Cc: Linux RAID Mailing List

[-- Attachment #1: Type: text/plain, Size: 1736 bytes --]

Well, it's been two full days (48+ hours) and no error... I guess
smartctrl doesn't like WDC drives (or better yet, vise versa).

  Danilo


Justin Piszcz pravi:
>
>
> On Wed, 1 Oct 2008, Danilo Godec wrote:
>
>> 1. I will get a new server in a couple of days and then I'll be able
>> to move the Xen VM's from the 'problematic' server. Then I'll see
>> what can be updated/upgraded.
>> 2. The errors are pretty much random and there is nothing in the cron
>> at all. I don't think Xen VM's could do anything with the physical
>> drive, so their crons shouldn't be relevant.
>> 3. It's not really a problem that we (the users) would feel. It's
>> just the logs that got me worried (I don't like unexplainable hard
>> drive errors).
>> 4. As said before, I changed the scripts to use 'smartctl' with one
>> of the other drives. So far it seems better - there hasn't been a
>> single error in 12 hours.
>>>
>> If it turns out that 'smartctl' was causing this I'll report it to
>> 'smartmontools' guys.
>>
>> Thanks for the help, Danilo
>>
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
> I actually turned off smart(daemon) /etc - the problems still persist
> for me..
>
> Justin.


-- 
Danilo Godec, sistemska podpora

Predlog! Obiscite prenovljeno spletno stran http://www.agenda.si

ODPRTA KODA IN LINUX
STORITVE : POSLOVNE RESITVE : UPRAVLJANJE IT : INFRASTRUKTURA IT : IZOBRAZEVANJE : PROGRAMSKA OPREMA

Visit our updated web page at http://www.agenda.si

OPEN SOURCE AND LINUX
SERVICES : BUSINESS SOLUTIONS : IT MANAGEMENT : IT INFRASTRUCTURE : TRAINING : SOFTWARE 


[-- Attachment #2: danilo_godec.vcf --]
[-- Type: text/x-vcard, Size: 302 bytes --]

begin:vcard
fn:Danilo Godec
n:Godec;Danilo
org:Agenda OpenSystems
adr:;;Ul. Pohorskega bataljona 49;Maribor;;2000;Slovenia
email;internet:danilo.godec@agenda.si
tel;work:+386 2 4216138
tel;fax:+386 2 4216141
tel;cell:+386 40 618802
x-mozilla-html:FALSE
url:http://www.agenda.si/
version:2.1
end:vcard


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: SATA errors?
  2008-10-03  9:52               ` Danilo Godec
@ 2008-10-03 16:13                 ` Iustin Pop
  0 siblings, 0 replies; 10+ messages in thread
From: Iustin Pop @ 2008-10-03 16:13 UTC (permalink / raw)
  To: Danilo Godec; +Cc: Linux RAID Mailing List

On Fri, Oct 03, 2008 at 11:52:02AM +0200, Danilo Godec wrote:
> Well, it's been two full days (48+ hours) and no error... I guess
> smartctrl doesn't like WDC drives (or better yet, vise versa).

I don't know about your specifics, but I had multiple (~10) WDC
harddrives running with smartctl/smartd polling them periodically, and
with various controllers (VIA, Promise, 3ware, but not Intel). I didn't
have any issues like this.

Maybe it's a specific interation between the drive and the controller?
Or the driver?

iustin

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2008-10-03 16:13 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-10-01  9:25 SATA errors? Danilo Godec
2008-10-01 10:08 ` Wolfgang Denk
2008-10-01 10:41   ` Danilo Godec
2008-10-01 11:48     ` David Lethe
2008-10-01 12:17       ` David Greaves
2008-10-01 13:11         ` David Lethe
2008-10-01 21:36           ` Danilo Godec
2008-10-01 22:03             ` Justin Piszcz
2008-10-03  9:52               ` Danilo Godec
2008-10-03 16:13                 ` Iustin Pop

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).