wbetts's blog

Notes on configuring the disks in Dell PowerEdge 1750s as offline database slaves and online Linux pool nodes. (Generally applicable to most uses of software RAID in Linux, but most of the details below are assuming a typical STAR PE1750 setup.)

-----
November 9, 2012 update: This page was originally written with systems using 3 disks, each with a swap partition (typically the fourth partition, with partition type 82). This has the drawback that if a disk was being used for swap and it had a problem, it could crash the whole system - exactly what we are trying to avoid by using RAID. (Note though that this particular failure mode has never happened to us despite numerous disk problems - a simple combinatoric result of rarely using swap on our nodes, what little swap we use is typically not spread across all disks, and many disk failures are partial rather than complete.)

I think we have had all such instances eliminated for some time now (March 2014), which slightly modifies the procedure to replace a disk (one more RAID array to deal with, but there's no need for any swapoff/swapon actions).

-----

The 3 disks (each 146GB SCSI U320) are partitioned identically. For instance:

[root@coburn raidinfo]# sfdisk -d /dev/sda
# partition table of /dev/sda
unit: sectors

/dev/sda1 : start=       63, size=   256977, Id=fd, bootable
/dev/sda2 : start=   257040, size=263787300, Id=fd
/dev/sda3 : start=264044340, size= 20482875, Id=fd
/dev/sda4 : start=284527215, size=  2104515, Id=82

(Unfortunately, the partitioning set-up has a degree of randomness that I do not understand -- the ordering of the partitions does not always come out as above, so verify any given node's layout before trying anything!)

The small 125 MB partitions (/dev/sd[abc]1 in this example) are in a RAID 1 configuration (3 identical disks) which mounts on '/boot'.

The large ~115GB partitions (/dev/sd[abc]2) are in a RAID 5 configuration mounted on '/db01' and is mounted with the noatime option in '/etc/fstab'.

The 10GB partitions (/dev/sd[abc]3) are in a RAID 5 configuration (for a total of 20GB+redundancy) which mounts on '/'.

The 1GB partitions (/dev/sd[abc]4) are swap space - all three are usable in normal operations for a total of 3 GB of swap. (Ideally, we wouldn't use ANY swap...)

Finally, each drive has a GRUB installation on the MBR, so if any one drive is removed, the system should still be able to boot and run. (In a total of one test, this was 100% successful...)

As a test, I removed one of the drives from the RAID array configurations, and physically removed it from the system. I was then able to rebuild a replacement drive with no need to reboot at any point.

What follows are my notes for manipulating the drives, the RAID arrays, GRUB, swap, etc. that are especially useful when replacing a drive.

Let's assume we have reason to believe that the 'sda' drive is faulty and we want to replace it.

First, if the RAID subsystem has not already done it, we have to mark the individual device components as faulty (use 'cat /proc/mdstat' to see the RAID status). Then we can remove the device. Be VERY CAREFUL to remove the correct device! Below we mark each RAID element as failed, and remove all failed devices from each array.

mdadm /dev/md0 -f /dev/sda3
mdadm /dev/md0 -r failed   #note, could be 'mdadm /dev/md0 -r /dev/sda3' to explicitely remove just that particular device (failed or otherwise!)
mdadm /dev/md1 -f /dev/sda2
mdadm /dev/md1 -r failed
mdadm /dev/md2 -f /dev/sda1
mdadm /dev/md2 -r failed

Now we remove the swap from use, if it is still active (use 'swapon -s' to list the active swap devices):

swapoff /dev/sda4

Now remove the device from the active scsi configuration (use 'cat /proc/scsi/scsi' to see the scsi devices and get the appropriate values for the numbers (adapter/Channel/Id/Lun):

echo "scsi remove-single-device 0 0 0 0" > /proc/scsi/scsi

At this point, the faulty disk can be physically removed.

October 2012 update: At this point, you can insert a new disk and proceed with the steps below. But using this procedure on our PowerEdge 1750s has led to terrible disk performance if there is no reboot after removing the failed drive. 2-3MB/s is the typical speed seen if there is no reboot (as opposed to 50+MB/s if a reboot is performed prior to insertion). So the recommendation for the PE 1750s is to remove the old disk at this point, reboot and let the machine finish booting, then insert the replacement disk and continue with the following. But also note that if done this way, the device names (sdX) are going to be different after the reboot (unless the last drive is the one that got replaced, in which case nothing changes). As said several times in this page - always verify the /dev/sdX device names, SCSI IDs, grub device names, etc before using any of these commands, or you WILL break your RAID arrays and have a very bad day... If you need the PE1750 to stay alive for some reason, proceed without the reboot, but try to reboot it at the next available opportunity to restore the drive performance.

Adding the new disk follows these basic steps:

1. Make the SCSI device active (note, this may not always be necessary (when using SATA devices perhaps?)):

echo "scsi add-single-device 0 0 0 0" >/proc/scsi/scsi

Here is another place to be VERY CAREFUL! Depending on the exact hardware involved (SCSI and SATA may behave differently in this regard) the new device may have already been detected and added. Use dmesg and /var/log/messages (plus the actual contents of /proc/scsi/scsi before and after insertion) to determine if the device has been automatically added, and if not, proceed to try to add it with the numbers determined from dmesg or /var/log/messages as appropriate. (If you do have to add by hand, most likely it will be the same numbers as used previously to remove the faulty drive (SCSI/SCA only?)). In any case, the device assignment (/dev/sdX) might not match the drive that was yanked (this may be case for ALL SATA drive replacements, TBC). In this case, the new device may be /dev/sdd (or sde, or whatever the next available letter happens to be) rather than /dev/sda. /var/log/messages and the output of dmesg should make it clear what device label was used for the new device. You should always verify the disk device and other parameters are what you think they are before proceeding - dmesg and /var/log/messages are your friends...

2. Now we need to put the partition table on the new disk. Here I assume I have a saved copy of the partition table that was previously generated with 'sfdisk -d'. For our typical PE1750 configuration, the partition tables are all identical, so you can generate a new table from a remaining disk with "sfdisk -d /dev/sdc > /root/raidinfo/partitions.sdc" (assuming sdc is not the one being swapped out):

sfdisk /dev/sda </root/raidinfo/partitions.sda

3. Add the partitions to their respective arrays (again a STRONG WARNING to verify the device strings below are appropriate - matching the device and the partition table in use on the replacement disk):

mdadm /dev/md0 -a /dev/sda1
mdadm /dev/md1 -a /dev/sda3
mdadm /dev/md2 -a /dev/sda2

Now the arrays should start rebuilding themselves. You can verify the rebuild is started with 'cat /proc/mdstat'. See the note below step 5 for more info about the rebuild process.

4. Enable the swap space. (The labelling is arbitrary, but should match what is found in the corresponding entry in /etc/fstab so that it is activated in subsequent boots):

mkswap -L SWAP-sda4 /dev/sda4
swapon /dev/sda4

5. Install grub on the new disk, so that it can be booted from in the future if needed. This will look something like this:

grub
Grub> device (hd0) /dev/sda
Grub> root (hd0,0)
Grub> setup (hd0)

(The lines above are my historical method, but at some point I discovered the "grub-install" command, which seems an improvement, with less liklihood of mistakes. That should look like the following.)

grub-install /dev/sda

grub-install /dev/md0

I'd like to know a way to verify the GRUB installation is valid without having to reboot (at least verify the GRUB installation is in place), but I don't know a way to tell if GRUB is installed on a disk, or verify it has a viable installation... There is a --recheck option for the grub-install command, but the man page from a Grub 0.97 installation says "This option is unreliable and its use is strongly discouraged."

---

NOTES about the RAID rebuild:

Apparently, only one degraded software RAID array will be rebuilt at a time. If multiple rebuilds are pending, they will proceed sequentially, rather than simultaneously.

In the system log (/var/log/messages), when reconstruction begins something like the following should appear:

md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc.
md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reconstruction.

/proc/mdstat will give the completion percentage and an estimate of the remaining time during a rebuild operation.

Prior to (or possibly even during) a RAID rebuild, the speed range may be adjustable. Here is an example displaying the default max and min rates (in KB/sec) and then adjusting the minimum:

cat /proc/sys/dev/raid/speed_limit_max
200000
cat /proc/sys/dev/raid/speed_limit_min
1000

echo 5000 >/proc/sys/dev/raid/speed_limit_min

Growing a RAID array:

A bit off topic, but for lack of a better place, here is a nice bit of info to have in mind. It is possible to increase a RAID array size by replacing existing disks with larger disks. I have successfully done this with RAID 1 on a running system, and expect the same process would work with RAID5 and other levels (obviously not RAID 0!). The outline to this process goes like this:

For each disk to be replaced with a larger one, follow a recipe similar to that above, replacing *one* of the disks in the array with a larger disk, partitioned with RAID partitions that are sized as desired for the final arrangement. *Let it finish rebuilding/resyncing the RAID array(s) after each individual disk replacement.* (If you have a multi-redundant array, eg. RAID 6, 10, etc., and think about it carefully, perhaps you could replace two disks at once to speed up the process, with an increase in risk of total data loss...)

With the RAID devices all configured and fully rebuilt, one can grow the individual RAID devices (/dev/md0, /dev/md1, etc) with

mdadm --grow -z max /dev/md<N>

It will automatically grow to the largest available size for the given constituents.

Then resize the filesystem(s) on the RAID device(s) (assuming ext3):

resize2fs /dev/md<N>

To adjust swap size on a RAID device in a similar manner, the swapoff, mkswap and swapon commands should do the trick (an exercise left for the reader to read the man pages :-)

Groups:

Software and Computing

wbetts's blog
Login or register to post comments

The STAR experiment

wbetts's blog

Software and Computing

User login

Navigation

Group notifications

Dell PowerEdge 1750 setup for offline DB slaves

Growing a RAID array: