Storage hardware/software overview
From Darwin
The Darwin compute facility has three main parallel filesystems. All three use IBM GPFS filesystem technology. The capacity, backup policies and level of redundant failover hardware used in each system are different.
Contents |
Hardware
The Darwin filesystems are built over a set of 768 750GB SATA harddrives. These drives are managed by a set of eight IBM DS4700 drive array controllers. Each of these units manages a set of drives held in DS4000 EXP810 expansion units. The DS4700 controllers attach to a set of eight IBM x3650 nodes which export the drives using GPFS and configured as three different filesystems /home, /data and /scratch. All three filesystems are configured using a RAID-5 scheme with automatic rebuild to provide protection and automated recovery from single drive failures.
The /home filesystem
/home is the only file system that is backed up to tape.
The /home filesystem is used for account home directories. This is the filesystem that an account is using on initial login by default. The /home file system has a total capacity of 8.1TB. Each user account has a 200GB quota of disk space by default. The /home storage will be backed up to tape once the tape subsystem is in place. The total size of /home and the individual account quotas are limited due to the capacity of the tape drive. Data writes to home are redundantly replicated, providing redundancy and additional (beyond RAID-5) protection against drive failures.
/home is the only file system that is backed up to tape.
The /data filesystem
The /data filesystem is not backed up
The '/data filesystem has a total capacity of 40TB. In addition to being RAID-5, data written to this filesystem is mirrored (a duplicate copy is written). This provides a fairly high degree of security against hardware failures, although the contents of /data are not backed up to tape by the Darwin adminstrators. Individual accounts have default quotas of 2TB storage on /data.
The /data filesystem is not backed up
The /scratch filesystem
The /scratch filesystem is not backed up
Eventaully the /scratch filesystem will have a capacity of between 400-500TB. At present, due to a firmware limitation on the IBM DS4700 controllers, the /scratch system is limited to 119TB. Data is stored using a RAID-5 scheme, but data is not mirrored on /scratch. Data stored on /scratch only has one level of protection against hardware failures and is potentially vulnerable to two drive failures occuring within a single group of five drives within a roughly twenty-four hour period. At this stage we do not have any evidence that this is an actual problem, however we do not have long statistics on drive reliability yet.
The /scratch filesystem is not backed up
Handling harddrive failures
The filesystems are built on top of groups of five disks called LUNs. Because the filesystem uses many many disks we expect there to be occaisional hard drive failures within these LUNs. Under a RAID-5 configuration, when a single drive failure is detected in a LUN, the LUN will continue to function ( RAID-5 systems store parity information on separate disks, allowing the stored data to be recreated automatically should a single disk fail) as follows
- the LUN uses parity information to continue to provide stored data, but operating in a recovery mode
- the system automatically begins to create a replacement drive using a drive from a pool of hot-standby disks
- once the hot-standby disk has been correctly initialized the LUN reverts to operating in a normal mode.
If these steps happen as described then everything will be fine. However, things can go awry if
- a hardware problem is not detected
- there are no hot standby disks available, perhaps because an earlier drive failure occured
- there is second drive failure on a LUN while the system is recreating the normal mode LUN.
in this case it is possible that some data will be lost. The /home and /data have an extra layer of redundancy, because all data is mirrored to two sets of LUNs. However, this doubles the number of drives needed to store a given amount of data. The /scratch file system does not mirror data writes, so it is slightly less robust, but much larger capacity.
