teh bigbro blog(tm)
Bigbro's foray into the scary world of blogging

Mon, 01 Aug 2005

Fixing a broken SCSI drive in a software RAID5 array

Very, very briefly, and mostly so I remember how to do it the next time I need to:
Assume that the RAID5 volume consists of at least /dev/sdb1, /dev/sdc1 and some others. Each partition is a type 'fd' partion (Linux raid autodetect) and takes the whole disk. /dev/sdc is the disk that has failed and /dev/sdb still works (This is only important because we're going to copy the partition information from /dev/sdb.)
/dev/sdc is SCSI device 0 2 0 on SCSI adaptor 0 (Host: scsi0 Channel: 00 Id: 02 Lun: 00 from /proc/scsi/scsi.)
Finally, and this is very VERY important - this procedure worked for me, on my hardware. It may not work on yours and it may instead remove all your data, the data from your next door neighbour's machine and make all the milk in your fridge turn sour (and possibly kill your household pets.) I take exactly 0 (zero) responsibility for any loss you endure, direct or consequential for trying this procedure, modified or otherwise, on your system (or anyone else's, for that matter.) Make sure you have an tested, up-to-date backup of your entire system before trying this procedure. - thanks :-)
How to replace a failed SCSI disk that's part of a s/w RAID volume (RAID5):

  1. Remove the failed device from the RAID array (It's probably already done this if it's detected that it's failed...
    mdadm --fail /dev/md0 /dev/sdc1
    mdadm --remove /dev/md0 /dev/sdc1
  2. Remove the failed SCSI disk:
    echo "scsi remove-single-device 0 0 2 0" > /proc/scsi/scsi
  3. Put a new disk in (carefully making sure that you remove the correct disk from the enclosure - otherwise you're liable to remove your entire RAID volume from existence - you have been warned!)
  4. Tell Linux to insert the new disk (rescan the SCSI bus)
    echo "scsi add-single-device 0 0 2 0" > /proc/scsi/scsi
  5. Partition the new disk appropriately. fdisk may not create partitions the same as the disk that was removed, so if you can copy the partition table from another disk that's still in the array and working, that's probably the best bet (if not always the safest thing to be trying)
    sfdisk -l /dev/sdb | sed -e 's/sdb/sdc/' | sfdisk /dev/sdc (recommended by elrond - you'll probably want to do it manually... not that I don't trust elrond or anything... ;-) )
  6. Zero the RAID superblock, just in case it had one before:
    mdadm --zero-superblock /dev/sdc1
  7. Add the partition on the new device back into the RAID volume and rebuild for resilience.
    mdadm --add /dev/md0 /dev/sdc1
  8. Show the details of the rebuild process:
    mdadm --detail /dev/md0
    or
    cat /proc/mdstat

posted at: 16:47 | path: /technical | permanent link to this entry


copyright © 2005-2008, Gareth Eason