Tuesday, January 20, 2009

Opensolaris 2008.11 active/active disk replication

*This post is IN PROGRESS and will/may be edited*
* I have given up on this. I've gotten zero help from Sun, and others have posted that THIS DOES NOT WORK - Good luck with OpenSolaris and AVS / ZFS replication*

In this post, I'll describe what I had to do to get Seemless AVS /ZFS replication on my homegrown OpenSolaris box. This is based on this blog entry http://blogs.sun.com/AVS/entry/avs_and_zfs_seamless . That is ZFS (Zeta File System), AVS (Availability Suite) using the svm (Sun Volume Manager). It assumes you are intimately familiar with Solaris, and you have the budget of a large corporation. I run a small email service based on FreeBSD, and don't have any Solaris experience. The following is what I've learned.

Install. As I've already said, the installation is extremely easy. I've discovered that in this particular case, that's a bad thing. Solaris has a unique way of putting data on a disk. Typically, an OS will create a partition for itself on a drive and that's it. Solaris takes that a step further by creating 'slices' inside the Solaris partition. Historically, each slice was a UFS mount point. So one slice would be usr, another would be var, etc. You see this on other OSs, using regular primary and extended partitions. With the advent of ZFS, I think this is probably unncessary, but there are legacy applications such as Availability Suite, that use these slices. This is where our problem begins... Read on for a fix.

My system is configured as follows:
2 80GB drives in hardware RAID 1 on an Areca controller
6 500GB drives JBOD using the MB controller

The feeling I get from #opensolaris is that I'm an idiot for doing hardware RAID at all. Personally, I just don't trust software RAID for an essential system's boot drives. No big deal I think, until I get to the part where AVS replication needs to be configured. This needs to be said at the beginning. You MUST have a free slice for svm. svm uses what's called a metadb to store information about the soft volumes you are creating. The OpenSolaris install creates a single root slice, and you're screwed because you have no fdisk options. Newbies, use format to READ the drive info. I know it's a bit disconcerting to type format at the command line without knowing what it will do, but it's pretty smooth:
root@sysvoltwo:~# format
Searching for disks...done

AVAILABLE DISK SELECTIONS:
0. c3t0d0
/pci@0,0/pci1022,9604@4/pci17d3,1200@0/sd@0,0
1. c4d0
/pci@0,0/pci-ide@11/ide@0/cmdk@0,0
2. c4d1
/pci@0,0/pci-ide@11/ide@0/cmdk@1,0
3. c5d0
/pci@0,0/pci-ide@11/ide@1/cmdk@0,0
4. c5d1
/pci@0,0/pci-ide@11/ide@1/cmdk@1,0
5. c7d0
/pci@0,0/pci-ide@14,1/ide@1/cmdk@0,0
6. c7d1
/pci@0,0/pci-ide@14,1/ide@1/cmdk@1,0
Specify disk (enter its number): 0
selecting c3t0d0


As you can see, '0' is my hardware RAID, and 1-6 are my JBOD.
Then we type 'part' to enter the partition manager, and 'print' to print the partition:
Current partition table (original):
Total disk cylinders available: 9723 + 2 (reserved cylinders)

Part Tag Flag Cylinders Size Blocks
0 root wm 1 - 8480 64.96GB (8480/0/0) 136231200
1 unassigned wm 0 0 (0/0/0) 0
2 backup wu 0 - 8481 64.98GB (8482/0/0) 136263330
3 unassigned wm 0 0 (0/0/0) 0
4 unassigned wm 0 0 (0/0/0) 0
5 unassigned wm 0 0 (0/0/0) 0
6 unassigned wm 0 0 (0/0/0) 0
7 unassigned wm 0 0 (0/0/0) 0
8 boot wu 0 - 0 7.84MB (1/0/0) 16065
9 unassigned wm 0 0 (0/0/0) 0


Now at this point what I've done is a full install of OpenSolaris onto a 65GB Solaris partition. You have the 'root' slice which is almost 100% of the Solaris partition, the backup slice, which is 100%, and the boot slice. This leaves about 15GB to 'expand' to. What I did was the following.
First save the slice info somewhere as you'll need it again.

1. Boot off your OpenSolaris install disk
2. Open a terminal
3. #pfexec format (pfexec lets you run format as root)
4. select your drive, type fdisk:
format> fdisk
Total disk size is 9726 cylinders
Cylinder size is 16065 (512 byte) blocks

Cylinders
Partition Status Type Start End Length %
====== ===== ======== ==== ==== ==== ===
1 Active Solaris2 1 8482 8482 85


(I've reentered these numbers from memory, so they may not be exact)

Now, we delete that partition, and create a new one:

Cylinders
Partition Status Type Start End Length %
========= ====== ============ ===== === ====== ===
1 Active Solaris2 1 9725 9725 100


Now that we've re-created the partition, the slices will be messed up.
quit from format, and type fdisk to enter fdisk
You'll need to re-create the root slice based on your saved data.
Then add a slice 7 for your metadb - when you're done, it should look similar to this:

Current partition table (original):
Total disk cylinders available: 9723 + 2 (reserved cylinders)

Part Tag Flag Cylinders Size Blocks
0 root wm 1 - 8480 64.96GB (8480/0/0) 136231200
1 unassigned wm 0 0 (0/0/0) 0
2 backup wu 0 - 9722 74.48GB (9723/0/0) 156199995
3 unassigned wm 0 0 (0/0/0) 0
4 unassigned wm 0 0 (0/0/0) 0
5 unassigned wm 0 0 (0/0/0) 0
6 unassigned wm 0 0 (0/0/0) 0
7 unassigned wm 8481 - 9722 9.51GB (1242/0/0) 19952730
8 boot wu 0 - 0 7.84MB (1/0/0) 16065
9 unassigned wm 0 0 (0/0/0) 0


I stopped one cylinder short of the last, just to be safe.

Because of the fdisk fix we can FINALLY create our metadb on our metadb slice!
# metadb -a -f -c 4 /dev/rdsk/c3t0d0s7

----------------- Install and Configure AVS ---------------------

Now we can reboot the system, log in, and continue with our AVS install.
In the package manager, do an 'Update All'.
After you've rebooted, go back into the Package Manager and search for 'avail'. The descriptions should now be there, and all of the packages for the availability suite will be found.

# dscfgadm 'y' to start avs service.
Would you like to start the services now? [y,n,?] y
If get you something like:
svcadm: Instance "svc:/system/nws_scm:default" is in maintenance state.
nws_scm failed to enable


Then:
#svcadm enable nws_scm
Then re-run dscfgadm

#dscfgadm -i to verify services are all running - enable as necessary.

Now, as we've gotten from the format command, my JBOD disks to be replicated are:
c4d0,c4d1,c5d0,c5d1,c7d0,c7d1

Most of the below is replicated from the linked blog above, corrected for my disk set.
I had to create and delete a zpool on each disk to 'initialize' each drive (place an EFI label):
zpool create -f temp c4d0; zpool destroy temp

On one drive run:
#dsbitmap -r /dev/rdsk/c4d0s0 | tee /tmp/vol_size

#VOL_SIZE="`cat /tmp/vol_size| grep 'size: [0-9]' | awk '{print $5}'`"
#BMP_SIZE="`cat /tmp/vol_size| grep 'Sync ' | awk '{print $3}'`"
#SVM_SIZE=$((((BMP_SIZE+((16-1)/16))*16)*2))
#ZFS_SIZE=$((VOL_SIZE-SVM_SIZE))
#SVM_OFFS=$(((34+ZFS_SIZE)))
#echo "Original volume size: $VOL_SIZE, Bitmap size: $BMP_SIZE"
#echo "SVM soft partition size: $SVM_SIZE, ZFS vdev size: $ZFS_SIZE"

What he's doing is getting the numbers to hard format these disks for svm use, and creating a slice for svm and another for zfs.

Now we create a find command or two to make things easier:
This is mine: find /dev/rdsk/c[457]d[01]s0

Now format the drives:
#find /dev/rdsk/c[457]d[01]s0 | xargs -n1 fmthard -d 0:4:0:34:$ZFS_SIZE
#find /dev/rdsk/c[457]d[01]s0 | xargs -n1 fmthard -d 1:4:0:$SVM_OFFS:$SVM_SIZE
#find /dev/rdsk/c[457]d[01]s0 | xargs -n1 prtvtoc |egrep "^ [01]|partition map"

If you missed ZFS 'initializing' a drive, the output won't look right.

Change the find command from above, with the additional selection of only even numbered disks, placing slice 1 of all selected disks into the SVM metadevice d101

# find /dev/rdsk/c[457]d0s1 | xargs -I {} echo 1 $1\{} | xargs metainit d101 `find /dev/rdsk/c[457]d0s1 | wc -l`

Re-use the corrected find command from above, with the additional selection of only odd numbered disks, placing slice 1 of all selected disks into the SVM metadevice d102

# find /dev/rdsk/c[457]d1s1 | xargs -I {} echo 1 $1\{} | xargs metainit d102 `find /dev/rdsk/c[457]d1s1 | wc -l`

Now mirror metadevice d101 and d102, into mirror d100, ignoring the WARNING that both sides of the mirror will not be the same. When the bitmap volumes are createD, they will be initialized, at which time both sides of the mirror will be equal.

#metainit d100 -m d101 d102

Now from the mirror SVM storage pool, allocate bitmap volumes out of SVM soft paritions for each SNDR replica

#OFFSET=1
#for n in `find /dev/rdsk/c[457]d[01]s1 | grep -n s1 | cut -d ':' -f1 | xargs`
do
metainit d$n -p /dev/md/rdsk/d100 -o $OFFSET -b $BMP_SIZE
OFFSET=$(((OFFSET + BMP_SIZE + 1)))
done

I just cut/paste the above, and dropped the # to make sure it would all work.

Generate the SNDR enable on NODE-A

#DISK=1
#for ZFS_DISK in `find /dev/rdsk/c[457]d[01]s0`
do
sndradm -nE sysvolone $ZFS_DISK /dev/md/rdsk/d$DISK sysvoltwo $ZFS_DISK /dev/md/rdsk/d$DISK ip sync g zfs-pool
DISK=$(((DISK + 1)))
done

12). Generate the SNDR enable on NODE-B

#DISK=1
#for ZFS_DISK in `find /dev/rdsk/c[457]d[01]s0`
do
sndradm -nE sysvoltwo $ZFS_DISK /dev/md/rdsk/d$DISK sysvolone $ZFS_DISK /dev/md/rdsk/d$DISK ip sync g zfs-pool
DISK=$(((DISK + 1)))
done

Perform zpool enables On each box:

**** OLD find /dev/rdsk/c[457]d[01]s0 | xargs zpool create zfs-pool
On Opensolaris, I needed awk to get the device names without path
#find /dev/rdsk/c[57]d[01]s0 | awk -F / '{print $4}' | xargs zpool create zfs-pool

Actually I did a raidz, because I'm just not quite sure what's happening yet on each system with the metadevices. I thought they would be two pools of 3 disks, but I still get 6 disks worth of space in the normal zpool.
#find /dev/rdsk/c[57]d[01]s0 | awk -F / '{print $4}' | xargs zpool create zfs-pool raidz

Enable replication:
#sndradm -g zfs-pool -nu

View changes:
#sndradm -g zfs-pool -P
output is:
/dev/rdsk/c4d0s0 -> sysvoltwo:/dev/rdsk/c4d0s0
autosync: off, max q writes: 4096, max q fbas: 16384, async threads: 2, mode: sync, group: zfs-pool, state: logging
/dev/rdsk/c4d1s0 -> sysvoltwo:/dev/rdsk/c4d1s0
autosync: off, max q writes: 4096, max q fbas: 16384, async threads: 2, mode: sync, group: zfs-pool, state: logging
/dev/rdsk/c5d0s0 -> sysvoltwo:/dev/rdsk/c5d0s0
autosync: off, max q writes: 4096, max q fbas: 16384, async threads: 2, mode: sync, group: zfs-pool, state: logging
/dev/rdsk/c5d1s0 -> sysvoltwo:/dev/rdsk/c5d1s0
autosync: off, max q writes: 4096, max q fbas: 16384, async threads: 2, mode: sync, group: zfs-pool, state: logging
/dev/rdsk/c7d0s0 -> sysvoltwo:/dev/rdsk/c7d0s0
autosync: off, max q writes: 4096, max q fbas: 16384, async threads: 2, mode: sync, group: zfs-pool, state: logging
/dev/rdsk/c7d1s0 -> sysvoltwo:/dev/rdsk/c7d1s0
autosync: off, max q writes: 4096, max q fbas: 16384, async threads: 2, mode: sync, group: zfs-pool, state: logging

#metastat -p
output is:
d6 -p /dev/md/rdsk/d100 -o 18641 -b 3727
d100 -m /dev/md/rdsk/d101 /dev/md/rdsk/d102 1
d101 3 1 /dev/rdsk/c4d0s1 \
1 /dev/rdsk/c5d0s1 \
1 /dev/rdsk/c7d0s1
d102 3 1 /dev/rdsk/c4d1s1 \
1 /dev/rdsk/c5d1s1 \
1 /dev/rdsk/c7d1s1
d5 -p /dev/md/rdsk/d100 -o 14913 -b 3727
d4 -p /dev/md/rdsk/d100 -o 11185 -b 3727
d3 -p /dev/md/rdsk/d100 -o 7457 -b 3727
d2 -p /dev/md/rdsk/d100 -o 3729 -b 3727
d1 -p /dev/md/rdsk/d100 -o 1 -b 3727

Thursday, January 8, 2009

OpenSolaris 2008.11 - Notes for Qmail and Vpopmail

I was very excited to try out OpenSolaris 2008.11. You see, I had a need for active-active replication on the cheap. I also have a new found love of ZFS. ZFS works wonderfully on FreeBSD 7.0, but I'm having bi-monthly issues with that system, and I'm attributing it to ZFS's beta state. Yes, it's 64bit, and has 8GB of RAM. I have also had to tweak some settings, but I just have to have something more stable.

Enter OpenSolaris. A Free OS with the stability of Solaris, along with mature ZFS code, PLUS the recently released Availability Suite into the Open Source community, and I have an excellent solution that fits my problem. At least, on paper. The plan is to have 2 servers, both with a single OS hardware RAID mirrored drive, and 6 JBOD 500GB SATA drives. Using AVS, I will mirror every other drive to the opposite system as shown here:
http://blogs.sun.com/AVS/entry/avs_and_zfs_seamless

Back to my exctement about OpenSolaris 2008.11. Why was I excited? Because 2008.5 didn't support the Areca OR LSI RAID cards I purchased. That really sucked. These things aren't cheap, and I'm trying to do this on the cheap. Fortunately just as I was about to ditch the whole idea, 2008.11 came out with support for Areca.

Installation. As opposed to Solaris (which I also tried, and didn't support my RAID cards either), OpenSolaris' installation program is quite nice. It's very straightforward and simple. I booted off the CD into the GUI system, and clicked 'Install'. I was able to install the OS on my hardware mirrored drives fairly rapidly.

Configuration. Here's where the trouble started - at least, after the driver debacle. I tried to change the IP Address. Sounds simple, doesn't it? Hell no. I'm ssh'd into these machines, and supposedly, you can just change a couple files in /etc. Doesn't work. I also tried the 'sys-unconfig' command to start over from scratch. This cleared the system, but I discovered a new issue: The re-configuration doesn't come up in GUI mode. You MUST select TEXT mode from Grub when booting. Amusingly, this is the only time 'TEXT' mode actually works. If you select to boot into TEXT mode when the system is fully operational, it will happily boot into the GUI after giving you something like 5 lines of output. Nice.

GUI posed a huge problem, I only have a single PS2 port on these servers, and no USB mouse. The GUI environment (GNOME) ABSOLUTELY SUCKS without a mouse. You cannot move around at all. I had to resort to banging on the keyboard to find a combination that would get me out of certain menus. It's awful. Eventually, I found a USB mouse and was able to quickly configure the network on the 2nd machine via the GUI. After a reboot, it's finally active, it's not a 'live' change.

Oh yeah, and edit /etc/nsswitch.conf
Make sure:
hosts: files dns mdns
ipnodes: files dns mdns

This doesn't change automatically for some reason.

I'd also like to note I can't log into the GUI as root, but oddly enough, as 'rick' I can change the network settings and install packages. This is not good.

Now I needed to install my software. At that point, I didn't have a USB mouse, so I was installing via SSH. Bad idea. pkg -r search 'name' is a good way to get a whole list of packages, but the names are all funky. MySQL is SUNWmysql5. Why can't it be just 'mysql5' ? And it's not enabled, so we need:

svcadm enable mysql5
pfexec svccfg import /var/svc/manifest/application/database/mysql.xml
svcs -xv mysql

Now I realize I need packages that the SUN Site (ha!) doesn't have, so we need to add a couple of authorities:

pkg set-authority -O http://blastwave.network.com:10000 blastwave
pkg set-authority -O http://pkg.sunfreeware.com:9000 sunfreeware

Then add the packages:

pkg refresh

pkg install SUNWgcc
pkg install SUNWcurl
pkg install IPSpkgconfig
pkg install IPSgawk
pkg install IPSFWlynx

pkg install perl-dbi

Ok, so after a lot of searching I have some base applications installed, but I need more. I need Perl packages. Ohh, but you can't just compile Perl packages without a hack - that would be silly!

vi /usr/perl5/5.8.4/lib/i86pc-solaris-64int/Config.pm

Find these variables, and set them as follows:
cccdlflags='-fPIC'
optimize='-O3'

Oh, and change the compiler
ln -s /usr/bin/cc /usr/bin/gcc

Now you can use cpan to install Perl modules.

Note: When using useradd, you MUST SUPPLY A VALID SHELL! !#@#!@!
So that means after you add all your daemon users, modify the shell in /etc/passwd.
Maybe there's a better way, I dunno.

Installing Qmail:
groupadd nofiles
useradd -g nofiles -d /var/qmail -s /bin/bash qmaild
useradd -g nofiles -d /var/qmail -s /bin/bash alias
useradd -g nofiles -d /var/qmail -s /bin/bash qmaill
useradd -g nofiles -d /var/qmail -s /bin/bash qmailp
groupadd qmail
useradd -g qmail -d /var/qmail -s /bin/bash qmailq
useradd -g qmail -d /var/qmail -s /bin/bash qmailr
useradd -g qmail -d /var/qmail -s /bin/bash qmails
mkdir /var/qmail
wget http://www.qmail.org/netqmail-1.06.tar.gz
make setup check

Installing vpopmail:
wget http://voxel.dl.sourceforge.net/sourceforge/vpopmail/vpopmail-5.4.17.tar.gz
groupadd -g 89 vchkpw
useradd -u 89 -g vchkpw -d /usr/local/vpopmail -s /bin/bash vpopmail

I had quota issues on my system. Anything over 2GB was wrong. The fix is to change off_t to int64_t in maildirquota.c Why there were no problems on other 64bit kernels, and even 32bit, is beyond me. Other than it being an OpenSolaris issue :/
Configure can be run as you wish, but I had to point to the libdir, and then link the MySQL library into a path where vpopmail could find it.

./configure --enable-auth-module=mysql --enable-rebuild-tcpserver-file=n --enable-ip-alias-domains=y --enable-valias=y --enable-qmail-ext=y --enable-mysql-replication=y --enable-incdir=/usr/mysql/include/mysql --enable-libdir=/usr/mysql/lib/mysql/

ln -s /usr/mysql/lib/mysql/libmysqlclient.so.15 /usr/lib/libmysqlclient.so.15

DOVECOT:
http://dovecot.org/releases/1.1/dovecot-1.1.7.tar.gz
./configure

YAY! No issues on a single running system. I've hosed up my second system while trying to learn AVS, so after I fight though the IP change again and figure out AVS, I'll follow up with another post.