Wednesday, January 26, 2011

AVS / ZFS Seamless (Copy of Original Sun Blog)

I can't let this die .. It's a solution with such high potential. It seems with the Oracle purchase of Sun, the Sun links are all dead. So I'm reposting the original page from the Wayback Machine for reference..


How 'suite' it is... - Jackie Gleason The "Availability Suite"

* All
* Personal
* Sun

« Sun StorageTek Avail... | Main
Tuesday Jun 12, 2007
AVS and ZFS, seamless?

A question was recently posted in zfs-discuss@opensolaris.org on the subject of AVS replication vs ZFS send receive for odd sized volume pairs, and does the use of AVS make it all seamless? Yes, the use of Availability Suite makes it all seamless, but only after AVS is initially configured.

Unlike ZFS, which was designed and developed to be very easy to configure, Availability Suite requires explicit and somewhat overly detailed configuration information to be setup, and setup correctly for it to work seamlessly.

Recently I worked with one of Sun's customers involving the configuration of two Sun Fire x4500 servers, a remarkably performing system, being a four-way x64 server, with the highest storage density available, being 24
TB in 4U of rack space. The customer's desired configuration was simple, two servers, in an active - active, high availability configuration, deployed 2000 km apart, with each system acting as the disaster recovery system for the other. Replication needed to be CDP, Continuous Data Protection, offering 24/7 by 365, in both directions, and once setup correctly, CDP would work seamlessly, and be a lights out operation.

Each x4500, or Thumper, comes with 48 disks, two of which will be used as the SVM mirrored system disk, (can't have a single point of failure), leaving 46 data disks. Since each system's configuration will be the disaster recovery system for the other site, this leaves 23 disks available on each system as data disks. The decision as to what type of ZFS provided redundancy, the number of volumes in each pool, if compression or encryption is enabled, is not a concern to Availability Suite, since whatever vdevs are configured, the ZFS volume and file metadata will get replicated too.

For testing out this replicated ZFS on AVS scenario in on my Thumper, here are the steps followed:

1). Take one of the 46 disks that will eventually be placed in the ZFS storage pool. Use the ZFS zpool utility to correctly format this disk, and action which will create a EFI labeled disk, with all available blocks in slice 0. Then delete the pool.

# zpool create -f temp c4t2d0; zpool destroy temp

2). Next run the AVS 'dsbitmap' utility to determine the size of an SNDR bitmap to replicate this disk's slice 0, saving the results for later use.

# dsbitmap -r /dev/rdsk/c4t2d0s0 | tee /tmp/vol_size
Remote Mirror bitmap sizing

Data volume (/dev/rdsk/c4t2d0s0) size: 285196221 blocks
Required bitmap volume size:
Sync replication: 1089 blocks
Async replication with memory queue: 1089 blocks
Async replication with disk queue: 9793 blocks
Async replication with disk queue and 32 bit refcount: 35905 blocks
Remote Mirror bitmap sizing

Selection will be for either synchronous replication with memory queues. Other replication types also work with ZFS, but synchronous replication is best, is network latency is low.

3). To assure redundancy of the SNDR bitmap, each will be mirrored via SVM, hence we will need to double the number of blocks needed, rounded up to a multiple of 8KB or 16 blocks

# VOL_SIZE="`cat /tmp/vol_size| grep 'size: [0-9]' | awk '{print $5}'`"
# BMP_SIZE="`cat /tmp/vol_size| grep 'Sync ' | awk '{print $3}'`"
# SVM_SIZE=$((((((BMP_SIZE+(16-1) / 16) * 16 ) * 2)))
# ZFS_SIZE=$((VOL_SIZE-SVM_SIZE))
# SVM_OFFS=$(((34+ZFS_SIZE)))
# echo "Original volume size: $VOL_SIZE, Bitmap size: $BMP_SIZE"
# echo "SVM soft partition size: $SVM_SIZE, ZFS vdev size: $ZFS_SIZE"

5). Use the 'find' utility below, adjusting its first parameter to produce the list of volumes that will be placed into the ZFS storage pool. Carefully examine this list, and adjust the first search parameter and/or use 'egrep -v "disk|disk"', for one or disks to exclude from this list any volumes that are not to be part of this ZFS storage pool configuration.

This resulting list produced by "find ...", is key in reformatting all of the LUNs that will be part of a replicated ZFS storage pool.

# find /dev/rdsk/c[45]*s0
or
# find /dev/rdsk/c[45]*s0 | egrep -v "c4t2d0s0|c4t3d0s0"

6). Re-use the corrected find command from above as the driver to change the format of all of those volumes.

# find /dev/rdsk/c[45]*s0 | xargs -n1 fmthard -d 0:4:0:34:$ZFS_SIZE
# find /dev/rdsk/c[45]*s0 | xargs -n1 fmthard -d 1:4:0:$SVM_OFFS:$SVM_SIZE
# find /dev/rdsk/c[45]*s0 | xargs -n1 prtvtoc |egrep "^ [01]|partition map"

7). Re-use the corrected find command from above, with the additional selection of only even numbered disks, placing slice 1 of all selected disks into the SVM metadevice d101

# find /dev/rdsk/c[45]*[24680]s1 | xargs -I {} echo 1 $1\{} | xargs metainit d101 `find /dev/rdsk/c[45]*[24680]s1 | wc -l`

8). Re-use the corrected find command from above, with the additional selection of only odd numbered disks, placing slice 1 of all selected disks into the SVM metadevice d102

# find /dev/rdsk/c[45]*[13579]s1 | xargs -I {} echo 1 $1\{} | xargs metainit d102 `find /dev/rdsk/c[45]*[13579]s1 | wc -l`

9). Now mirror metadevice d101 and d102, into mirror d100, ignoring the WARNING that both sides of the mirror will not be the same. When the bitmap volumes are createD, they will be initialized, at which time both sides of the mirror will be equal.

# metainit d100 -m d101 d102

10). Now from the mirror SVM storage pool, allocate bitmap volumes out of SVM soft paritions for each SNDR replica

# OFFSET=1
# for n in `find /dev/rdsk/c[45]*s1 | grep -n s1 | cut -d ':' -f1 | xargs`
do
metainit d$n -p /dev/md/rdsk/d100 -o $OFFSET -b $BMP_SIZE
OFFSET=$(((OFFSET + BMP_SIZE + 1)))
done

11). Repeat steps 1 - 10 on the SNDR remote system (NODE-B)

12). Generate the SNDR enable on NODE-A

# DISK=1
# for ZFS_DISK in `find /dev/rdsk/c[45]*s0`
do
sndradm -nE $NODE-A $ZFS_DISK /dev/md/rdsk/d$DISK NODE-B $ZFS_DISK /dev/md/rdsk/d$DISK ip sync g zfs-pool
DISK=$(((DISK + 1)))
done

13). Repeat step 12 on NODE-B

14). Perform then ZPOOL enables

# find /dev/rdsk/c[45]*s0 | xargs zpool create zfs-pool

15). Enable SNDR replication, and take a look at what you have done!

# sndradm -g zfs-pool -nu
# sndradm -g zfs-pool -P
# metastat -P
# zpool status zfs-pool

Posted at 07:04PM Jun 12, 2007 by jilokrje in Sun | Comments[2] | Permalink
Comments:

I'm trying this on Solaris Express b77

Step 14 doesn't work. I get the error:-

cannot use '/dev/rdsk/c3d0s0': must be a block device or regular file

"sndradm -g zfs-pool -P" should be a lowercase "-p"

Despite this I cannot get it to work. I can set it up OK on both machines, but nothing is being replicated between nodes at all.

Posted by Nathan on December 26, 2007 at 11:28 PM EST #

Sorry, I should have said "metastat -P" should be "metastat -p"

I kickstarted everything off with a full volume copy "sndradm -m" and then turned autosync on with "sndradm -a on" on both nodes. Now there's a lot of activity.

Posted by Nathan on December 27, 2007 at 04:15 AM EST #