TechSoup.org The place for nonprofits, charities, and libraries

Raid 1 Losing Hard Drives

  • Ok, so I've got a problem with my server that I'm kind of stumped on, I definitely need a few second opinions.

    Background: I just started for the NPO and we're running basically a glorified gaming machine with SBS 2003. (We have a new server on order, but it'll be a few weeks) I'm running on an Intel Dual-Core around 1.8Ghz, 2 GB Ram, and twin 150GB Hard drives in a Raid 1 configuration.

    So a couple weeks ago, right after I started, I come in one morning to find the Server not wanting to boot. The Raid controller claims that one of the hard drives has failed, and gives me the option to continue booting, destroy the array, replace and rebuild, etc. I try continuing to boot off the good drive, but it tells me it can't boot from the drive. Further investigation finds that the CMOS doesn't recognize either hard drive?! So I take turns unplugging the hard drives, and manage to boot from one while the other isn't plugged in. We replace the bad drive, rebuild the array, and everything is good.

    So I come into work today, find the power had gone out last night, and the server once again will not boot. Furthermore, I find that the battery backup software isn't installed, so the computer did not shut down last night when the battery backup kicked in. It continued to run and then went down hard when the battery died.

    I'm doubtful that the power outage could fry another hard drive, but I don't know what caused the last one (I had just started, hadn't even gotten to the server yet). I guess it could explain both drives but has anyone ever heard of it happening twice?

    None the less, if I try to rebuild the array on the bad drive, it should work right? I mean of course, if it's hardware related it won't, but if a few sectors are just bad, won't it write over them? And if it doesn't work out, it shouldn't affect the data on the good drive should it?

    Thank you
    Jake

  • I am far from an expert on this stuff, but I will try.

    You don't say if your RAID is hardware or software based. If it is software based, you are just plugging in two drives and setting up RAID in the server's OS. You can do that in both Microsoft server OS's and OS X.

    If that is the case, it can be very difficult to set up the boot process to allow failover, because the software that allows the mirroring is itself not loaded until the OS is loaded on booting. (Also, I believe software-based RAID cannot simultaneously write to both drives, so if the power goes out, there can be an issue with the two drives having different data on them.)

    Hardware RAID uses a RAID hard drive controller. When I was doing it, The RAID controller was a separate PCI card to install in the server, and then plug the drives into that. (And it took only SCSI hard drives too.) Now I believe RAID controllers can be integrated into server motherboards (and you can use SATA or PATA drives) Since you say you are running a glorified gaming PC as a server, I imagine you don't have one of those.

    So that might explain why your server isn't booting with one drive knocked out, the way it should. As to why the drives are not functioning, I couldn't say.

    But first, you have to make sure you have functioning hardware. You can test both drives for bad sectors and such, using chkdsk or a utility like Spinrite or Diskeeper. A few bad sectors aren't a big deal for new drives, once they are discovered and marked. (Bad sectors cannot be written on or read from. But the hard drive keeps a list of bad sectors, and does not use those sectors at all. That list is generated when you run chkdsk or one of those utilities. If a hard drive is going bad, it will start to generate more and more bad sectors, but a relatively new drive shouldn't do that. But if you find new bad sectors, you will have lost whatever was on those sectors, so you might have to reinstall stuff, or restore data from backup. you And yes, bad sectors on one drive won't affect another drive.)

    Once you have checked or replaced the drives, try setting them up in the server as single drives (and you can function like that for a while. You do back up the server, right?) That way you can see if the motherboard has a problem.

    Then install the battery backup software and the UPS. Then yank the power cord and see what happens. The server should shut itself down nicely. When you hit the power switch, it should boot without a problem.

    If you can do all that, then you know the problem was a combination of no battery backup software (to shut the computer down nicely) and the inability of software-based RAID to function worth a rat's ass (so as to let the computer reboot after a hard power down.)

    So when you get your new server, I would check out its RAID capabilities carefully. You may end up purchasing a separate RAID card (Adaptec used to make the best ones, but that was a while ago.)

    Hope this helps!
  • First and foremost, thank you for the help. I did miss a lot when posting yesterday. I'm gonna just spin off the head a little bit and bounce some idea's out.

    The motherboard in the server/gaming machine actually has a integrated Raid controller! (How weird is that for a gamer?) It has 2 IDE slots, and then 2 MORE Raid IDE slots. Right now, the hard drives are plugged into the Raid 1 slot on the motherboard, and the Raid controller is handling all duties.

    I guess I could pull the bad hard drive, plug it into one if the IDE slots, boot the computer and run chkdsk on the drive. I never really liked checkdisk because it doesn't attempt to save any data.. if it finds a bad sector it just marks it bad and moves on. I guess it doesn't really matter tho, I'm going to rebuild the array, so data on the drive will be written over anyway right? I'm trying to convince the financial director to invest in Spinrite but it's not working out so good and I didn't find the software on Techsoup so oh well.

    Thank you again for the help.. shoot me more ideas if they come to you.

    Jake
  • Glad I can help.

    You are right, chkdsk doesn't attempt to recover data from bad sectors, but as you say, it probably doesn't matter, since you are going to reinstall everything anyway. (IIRC, there is a switch for the format command to do the same thing as chkdsk.)

    Diskeeper comes with a 30 day trial period, so you might check that out. I don't recall if it does more than defrag the drive, though.


    So you have an integrated RAID controller? I guess RAID 0 would speed up disk access, and that might be good on a gamer PC, to reduce those pauses when it hits the swap file or loads a new game level.

    So then you need to figure out what is happening when one drive fails. So what I might do is take two old cheap drives, make sure they function properly, set them up in the server as RAID 1, boot up, then yank the power connection on one of them while it is doing something disk-intensive. The computer should not even slow down. (We ran RAID 1 with an Adaptec controller, and because my dimwit techie didn't check his logs, one drive was totally dead for a few weeks before we noticed.)

    Then you might shut down the server (the proper way), disconnect one drive, and see if it will still boot. Then switch drives. You are testing the failover capability, the scenario of arriving in the morning to find one drive damaged from the power outage.

    If that works, you know that something about a hard power-down rendered both drives unbootable, but everything else is probably okay. So then the thing is to prevent a hard power down, which you will do with the power software.

    Does that make sense? We are breaking down the incident with the power outage into pieces, to see where the problem is.

    Good luck!