Replacing a Bad Drive with ZFS

One of the drives in my home file server was making occasional nasty clicking noises, which always precedes death in hard drives. The drive is only a couple of months old, so it must have just been a bad apple. Anyway, some quick testing showed that it was failing its SMART self-tests, so it was quick and easy to get a warranty replacement from Seagate. Luckily replacing the drive was quick and easy and all of the data was safe, since my data is on a zpool consisting of four terabyte drives in a RAID-Z configuration (if you’re familiar with traditional RAID, think RAID-5). Of course the data is also backed up, because RAID is not a substitute for backups, but as expected ZFS “just worked” and the new drive took over for the old drive with no hassles.

If you want to relive the experience, here’s my session on the server after shutting down, replacing the failing drive, and starting back up. First, on the console, Solaris complained about something not being quite right as soon as the server booted:

SUNW-MSG-ID: ZFS-8000-FD, TYPE: Fault, VER: 1, SEVERITY: Major
EVENT-TIME: Wed Jan 14 19:12:39 PST 2009
PLATFORM: PowerEdge 1800, CSN: BSQMN91, HOSTNAME: athena
SOURCE: zfs-diagnosis, REV: 1.0
EVENT-ID: c6647451-fa5a-4f4b-99fd-de1e76bb059d
DESC: The number of I/O errors associated with a ZFS device exceeded
      acceptable levels.  Refer to http://sun.com/msg/ZFS-8000-FD for more
      information.
AUTO-RESPONSE: The device has been offlined and marked as faulted.  An attempt
      will be made to activate a hot spare if available. 
IMPACT: Fault tolerance of the pool may be compromised.
REC-ACTION: Run 'zpool status -x' and replace the bad device.

Yeah, it didn’t like booting with a totally different hard drive in place of a drive that was a member of a zpool. A quick check confirms that Solaris is, indeed, complaining about the drive I replaced:

mwilson athena:~ [1258]% zpool status tank
  pool: tank
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
        the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://www.sun.com/msg/ZFS-8000-2Q
 scrub: resilver completed after 0h0m with 0 errors on Wed Jan 14 19:12:11 2009
config:

        NAME        STATE     READ WRITE CKSUM
        tank        DEGRADED     0     0     0
          raidz1    DEGRADED     0     0     0
            c0t2d0  ONLINE       0     0     0
            c0t3d0  UNAVAIL      0     0     0  cannot open
            c0t4d0  ONLINE       0     0     0
            c0t5d0  ONLINE       0     0     0

errors: No known data errors

No surprises there, we’ll tell it to replace c0t3d0. Without any additional arguments, the zpool replace command will attempt to replace the old device with the same new device (c0t3d0, that is).

mwilson athena:~ [1260]% pfexec zpool replace tank c0t3d0

That command executes immediately and returns me to the prompt. We can monitor the status while the pool is resilvering:

mwilson athena:~ [1262]% zpool status tank
  pool: tank
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scrub: resilver in progress for 0h0m, 0.40% done, 2h45m to go
config:

        NAME              STATE     READ WRITE CKSUM
        tank              DEGRADED     0     0     0
          raidz1          DEGRADED     0     0     0
            c0t2d0        ONLINE       0     0     0
            replacing     DEGRADED     0     0     0
              c0t3d0s0/o  FAULTED      0     0     0  corrupted data
              c0t3d0      ONLINE       0     0     0
            c0t4d0        ONLINE       0     0     0
            c0t5d0        ONLINE       0     0     0

errors: No known data errors

And some time later…

mwilson athena:~ [1273]% zpool status tank
  pool: tank
 state: ONLINE
 scrub: resilver completed after 2h34m with 0 errors on Wed Jan 14 21:48:43 2009
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            c0t2d0  ONLINE       0     0     0
            c0t3d0  ONLINE       0     0     0
            c0t4d0  ONLINE       0     0     0
            c0t5d0  ONLINE       0     0     0

errors: No known data errors

Everything happy again! I love it when things just work how they’re supposed to.

  1. Angelo Corbo

    Thank you very much for the post, Matt. It worked flawlessly.

Leave a Comment


NOTE - You can use these HTML tags and attributes:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>