Replacing a Bad Drive with ZFS
One of the drives in my home file server was making occasional nasty clicking noises, which always precedes death in hard drives. The drive is only a couple of months old, so it must have just been a bad apple. Anyway, some quick testing showed that it was failing its SMART self-tests, so it was quick and easy to get a warranty replacement from Seagate. Luckily replacing the drive was quick and easy and all of the data was safe, since my data is on a zpool consisting of four terabyte drives in a RAID-Z configuration (if you’re familiar with traditional RAID, think RAID-5). Of course the data is also backed up, because RAID is not a substitute for backups, but as expected ZFS “just worked” and the new drive took over for the old drive with no hassles.
If you want to relive the experience, here’s my session on the server after shutting down, replacing the failing drive, and starting back up. First, on the console, Solaris complained about something not being quite right as soon as the server booted:
SUNW-MSG-ID: ZFS-8000-FD, TYPE: Fault, VER: 1, SEVERITY: Major
EVENT-TIME: Wed Jan 14 19:12:39 PST 2009
PLATFORM: PowerEdge 1800, CSN: BSQMN91, HOSTNAME: athena
SOURCE: zfs-diagnosis, REV: 1.0
EVENT-ID: c6647451-fa5a-4f4b-99fd-de1e76bb059d
DESC: The number of I/O errors associated with a ZFS device exceeded
acceptable levels. Refer to http://sun.com/msg/ZFS-8000-FD for more
information.
AUTO-RESPONSE: The device has been offlined and marked as faulted. An attempt
will be made to activate a hot spare if available.
IMPACT: Fault tolerance of the pool may be compromised.
REC-ACTION: Run 'zpool status -x' and replace the bad device.
Yeah, it didn’t like booting with a totally different hard drive in place of a drive that was a member of a zpool. A quick check confirms that Solaris is, indeed, complaining about the drive I replaced:
mwilson athena:~ [1258]% zpool status tank
pool: tank
state: DEGRADED
status: One or more devices could not be opened. Sufficient replicas exist for
the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
see: http://www.sun.com/msg/ZFS-8000-2Q
scrub: resilver completed after 0h0m with 0 errors on Wed Jan 14 19:12:11 2009
config:
NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
raidz1 DEGRADED 0 0 0
c0t2d0 ONLINE 0 0 0
c0t3d0 UNAVAIL 0 0 0 cannot open
c0t4d0 ONLINE 0 0 0
c0t5d0 ONLINE 0 0 0
errors: No known data errors
No surprises there, we’ll tell it to replace c0t3d0. Without any additional arguments, the zpool replace command will attempt to replace the old device with the same new device (c0t3d0, that is).
mwilson athena:~ [1260]% pfexec zpool replace tank c0t3d0
That command executes immediately and returns me to the prompt. We can monitor the status while the pool is resilvering:
mwilson athena:~ [1262]% zpool status tank
pool: tank
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scrub: resilver in progress for 0h0m, 0.40% done, 2h45m to go
config:
NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
raidz1 DEGRADED 0 0 0
c0t2d0 ONLINE 0 0 0
replacing DEGRADED 0 0 0
c0t3d0s0/o FAULTED 0 0 0 corrupted data
c0t3d0 ONLINE 0 0 0
c0t4d0 ONLINE 0 0 0
c0t5d0 ONLINE 0 0 0
errors: No known data errors
And some time later…
mwilson athena:~ [1273]% zpool status tank pool: tank state: ONLINE scrub: resilver completed after 2h34m with 0 errors on Wed Jan 14 21:48:43 2009 config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c0t2d0 ONLINE 0 0 0 c0t3d0 ONLINE 0 0 0 c0t4d0 ONLINE 0 0 0 c0t5d0 ONLINE 0 0 0 errors: No known data errors
Everything happy again! I love it when things just work how they’re supposed to.