EDIT: READ THIS BEFORE TRYING...
I think my last post was rather negative. I was getting discouraged with setting up a redundant failover SAN cluster because I found that DRBD was just too flaky in the setup I wanted. The problem though was that almost all home-grown cluster solution tutorials on the web use DRBD with either Heartbeat or Corosync. It is almost impossible to find a different, or even better solution... Until now of course.
I often times can be a bloodhound when there is something not working quite right, and I really want it too. I will forego sleep to work on projects, even in the lab to make things work right. I just don't like computers telling me I can't do something. I TELL YOU WHAT TO DO COMPUTERS! NOT THE OTHER WAY AROUND!
So since DRBD doesn't work worth a good God damn, I decided to look elsewhere. Enter GlusterFS! Never heard of it? Here is what Wikipedia says:
GlusterFS is a scale-out NAS file system developed by Gluster. It aggregates various storage servers over Ethernet or Infiniband RDMA interconnect into one large parallel network file system. GlusterFS is based on a stackable user space design without compromising performance. It has found a variety of applications including cloud computing, biomedical sciences and archival storage. GlusterFS is free software, licensed under GNU AGPL v3 license.
I like it way better than DRBD because it doesn't require meta-data to sync before it will work. It just works! In fact, once it is setup it writes like a true RAID 1. If you put a file in your sync'd directory it will automagically show up on the other node in the cluster in real time!
Ok, so I figured out how to make clustering work with GlusterFS and Heartbeat. What's this about deduplication and thin provisioning? Yes! I got that working as well. In fact, not only can we do deduplication, we can do compression if we want. How? It's all thanks to the miracle that is ZFS! What's ZFS? According to Wikipedia:
In computing, ZFS is a combined file system and logical volume manager designed by Sun Microsystems. The features of ZFS include data integrity verification against data corruption modes (like bit rot), support for high storage capacities, integration of the concepts of filesystem and volume management, snapshots and copy-on-write clones, continuous integrity checking and automatic repair, RAID-Z and native NFSv4 ACLs. ZFS is implemented as open-source software, licensed under the Common Development and Distribution License (CDDL). The ZFS name is a trademark of Oracle.
Although it does deduplication, I am not enabling it in my setup yet because of the high RAM requirements. Either that, or you can throw in a large SSD for caching. Wikipedia says this about dedup with ZFS:
Effective use of deduplication requires additional hardware. ZFS designers recommend 2 GiB of RAM for every 1 TiB of storage. Example: at least 32 GiB of memory is recommended for 20 TiB of storage. [31] If RAM is lacking, consider adding an SSD as a cache, which will automatically handle the large de-dupe tables. This can speed up de-dupe performance 8x or more. Insufficient physical memory or lack of ZFS cache results in virtual memory thrashing, which lowers performance significantly.
If you have the hardware already though, I'll show you how to enable dedup and/or compression later in this post.
Lets get started then, here is how my hardware is setup. Make changes as necessary to fit your hardware.
Server1: SuperMicro SC826TQ with 4 NICs, 2 quad core CPUs and 4GB of RAM, 3Ware 9750-4i RAID Controller, and twelve 2TB SATA Drives
Server2: SuperMicro SC826TQ with 4 NICs, 2 quad core CPUs and 4GB of RAM, 3Ware 9750-4i RAID Controller, and twelve 2TB SATA Drives
In the 3Ware management, I configured a 10GB OS partition and everything in RAID 5, I then partitioned my disks as follows on both servers:
Device | Mount Point | Purpose | Size |
/dev/sda1 | / | ext4 | 9GB |
/dev/sda2 | None | swap | 1GB |
/dev/sdb | None | raw | 19TB |
Also on both servers I have the NICs configured as follows:
Interface | Purpose | IP Address Server1 | IP Address Server2 | Network |
bond0 | Team | 100.100.10.12 | 100.100.10.13 | iSCSI |
bond1 | Team | 172.16.18.12 | 172.16.18.13 | Heartbeat |
eth0 | Slave to bond0 | N/A | N/A | iSCSI |
eth1 | Slave to bond1 | N/A | N/A | Heartbeat |
eth2 | Slave to bond0 | N/A | N/A | iSCSI |
eth3 | Slave to bond1 | N/A | N/A | Heartbeat |
I also have a virtual IP address of 100.100.10.14 that will be controlled by Heartbeat. The bond1 connection will connect one server to the other using crossover cables.
We are leaving /dev/sdb raw for now so we can use it for ZFS later. If you want to skip the whole ZFS thing, you can just partition /dev/sdb with ext4 as well. You just won’t get dedup or compression.
I also installed Ubuntu 10.10 on these servers because that is the latest version that the 3Ware 9750-4i supports. If you have the same card you can download the driver here: (Ubuntu 3Ware driver)
Once you get Ubuntu installed, and partitioned the way you want we first need some packages. One of the packages comes from a special repository. If you have Ubuntu 10.10 run the following to get the add-apt-repository command. Please note, that unless I specify otherwise, all commands will need to be done on both servers:
# sudo apt-get install python-software-properties
Now add the ZFS repository:
#sudo add-apt-repository ppa:dajhorn/zfs
Now lets update apt:
#sudo apt-get update
Next we install all of our necessary packages:
#sudo apt-get install –y snmpd ifenslave iscsitarget glusterfs-server glusterfs-client heartbeat ubuntu-zfs sysstat
If you only have two NICs, you don’t need ifenslave. You only need that to team your NICs. I will assume you have 4 NIC’s like me for the purpose of this post though. I also added snmpd so I could monitor my SANs with Zenoss, and sysstate so I can check I/O performance using the iostat command.
After that lets configure our NIC teams. Edit /etc/modprobe.d/aliases.conf with your favorite text editor. I like nano, for example:
#sudo nano /etc/modprobe.d/aliases.conf
Add the following:
alias bond0 bonding
options bond0 mode=0 miimon=100 downdelay=200 updelay=200 max_bonds=2
alias bond1 bonding
options bond1 mode=0 miimon=100 downdelay=200 updelay=200
Now edit /etc/network/interfaces and replace the contents with the following:
# The loopback network interface
auto lo
iface lo inet loopback
# The interfaces that will be bonded
auto eth0
iface eth0 inet manual
auto eth1
iface eth1 inet manual
auto eth2
iface eth2 inet manual
auto eth3
iface eth3 inet manual
# The target-accessible network interface
auto bond0
iface bond0 inet static
address 100.100.10.12 #(.13 for Server2)
netmask 255.255.255.0
broadcast 100.100.10.255
network 100.100.10.0
mtu 9000
up /sbin/ifenslave bond0 eth0
up /sbin/ifenslave bond0 eth2
# The isolated network interface
auto bond1
iface bond1 inet static
address 172.16.18.12 #(.13 for Server2)
netmask 255.255.255.0
broadcast 172.16.18.255
network 172.16.18.0
mtu 9000
up /sbin/ifenslave bond1 eth1
up /sbin/ifenslave bond1 eth3
If your network is not configured for jumbo frames, remove the mtu 9000 option. Also, make sure to change the IP information to both match your environment and your hosts. See the table above for IP assignments.
At this point since my iSCSI network has no Internet access I rebooted both servers and plugged them into the iSCSI switch.
When both hosts come back up, edit /etc/hosts and add the following on both servers:
172.16.18.12 Server1
172.16.18.13 Server2
This is so the servers can communicate by name over the Heartbeat connection. Once that is ready, run the following on each server:
#sudo touch /root/.ssh/authorized_keys
#sudo ssh-keygen -t dsa (Just press enter for everything)
#sudo scp /root/.ssh/id_dsa.pub root@Server1:/root/.ssh/authorized_keys
This will allow you to copy files back and forth without having to enter a password each time.
Now we configure out ZFS storage. If you check the table above to see that I am putting that on /dev/sdb. To do that run the following:
#zpool create data /dev/sdb
This mounts to /data automatically. According to Ubuntu-ZFS documentation, ZFS remounts it automatically at reboot. I found that to not be the case. Since it didn’t mount correctly, it made my GlusterFS setup fail after rebooting. To fix that I created a startup script in /etc/init.d called zfsmount with the following:
#!/bin/sh
zfs mount data
glusterfs -f /etc/glusterfs/glusterfs.vol /iscsi
/etc/init.d/glusterfs-server start
I made the script executable by running:
#sudo chmod +x /etc/init.d/zfsmount
I then copied that file over to the other server:
#sudo scp /etc/init.d/zfsmount root@Server2:/etc/init.d/
What that does is mounts the ZFS volume to /data, then mounts the Glusterfs client volume to /iscsi (We’ll get there) then starts the glusterfs-server daemon at boot. Because we want the zfsmount script to start the GlusterFS service, I also had to remove GlusterFS from rc.d by running the following:
#sudo update-rc.d –f glusterfs-server remove
We also want to make zfsmount run at boot, so we will add it to rc.d:
#sudo update-rc.d zfsmount defaults
At this point you can enable dedup or compression if you have the right hardware for it by running the following:
#sudo zfs set dedup=on data
or
#sudo zfs set compression=on data
Now our storage is ready, lets configure GlusterFS! Edit /etc/glusterfs/glusterfsd.vol, clear the contents and add this:
volume posix
type storage/posix
option directory /data
end-volume
volume locks
type features/locks
subvolumes posix
end-volume
volume brick
type performance/io-threads
option thread-count 8
subvolumes locks
end-volume
volume server
type protocol/server
option transport-type tcp
option bind-address 172.16.18.12 (or 172.16.18.13 on Server2)
option auth.addr.brick.allow 172.16.18.*
subvolumes brick
end-volume
Copy /etc/glusterfs/glusterfsd.vol to Server2:
#sudo scp /etc/glusterfs/glusterfsd.vol root@Server2:/etc/glusterfs/
Now start the GlusterFS Server service on both servers by running:
#sudo service glusterfs-server start
Now lets make our GlusterFS client directory by running the following:
#sudo mkdir /iscsi
Now lets edit /etc/glusterfs/glusterfs.vol, clear everything and add:
volume remote1
type protocol/client
option transport-type tcp
option remote-host 172.16.18.12
option remote-subvolume brick
end-volume
volume remote2
type protocol/client
option transport-type tcp
option remote-host 172.16.18.13
option remote-subvolume brick
end-volume
volume replicate
type cluster/replicate
subvolumes remote1 remote2
end-volume
volume writebehind
type performance/write-behind
option window-size 1MB
subvolumes replicate
end-volume
volume cache
type performance/io-cache
option cache-size 512MB
subvolumes writebehind
end-volume
Now you can mount our GlusterFS client to /iscsi by running the following:
#sudo glusterfs -f /etc/glusterfs/glusterfs.vol /iscsi
This will automatically happen at reboot now thanks to our handy zfsmount script. Now we need to make two directories in /iscsi. One for our iscsitarget configs, and the other for our LUNs. Now that GlusterFS is running, we only need to do this on Server1.
#sudo mkdir /iscsi/iet
#sudo mkdir /iscsi/storage
Now lets move our iscsitarget configs to /iscsi/iet:
#sudo mv /etc/iet/* /iscsi/iet/
Now we will create links to those files:
#sudo ln –s /iscsi/iet/* /etc/iet/
On Server2 run the following:
#sudo rm /etc/iet/*
#sudo ln -s /iscsi/iet/* /etc/iet/
Now our iscsitarget configs only need to be changed in one spot, and it’s automatically replicated to the other node. Now it’s time to configure Heartbeat which will manage iscsitarget as well as our virtual IP address.
On Server1 you will need to edit three files in /etc/heartbeat. File one is ha.cf, edit it as follows:
logfacility local0
keepalive 2
deadtime 30
warntime 10
initdead 120
bcast bond0
bcast bond1
node Server1
node Server2
auto_failback no
Next edit authkeys with the following:
auth 2
2 crc
Now set permissions on authkeys by running:
#sudo chmod 600 /etc/heartbeat/authkeys
Finally we edit haresources and add the following:
IPaddr::100.100.10.14/24/bond0 iscsitarget
Now copy those files over to Server2:
#sudo scp /etc/heartbeat/ha.cf root@Server2:/etc/heartbeat
#sudo scp /etc/heartbeat/authkeys root@Server2:/etc/heartbeat
#sudo scp /etc/heartbeat/haresources root@Server2:/etc/heartbeat
Finally we are ready to configure some storage. I will show you how to create a LUN using either thin or thick provisioning. We will do this with the DD command.
Change into your /iscsi/storage directory. To create a 1TB thin provisioned LUN called LUN0 you would run the following:
#sudo dd if=/dev/zero of=LUN0 bs=1 count=0 seek=1T
To create the same LUN, but thick provisioned run:
#sudo sudo dd if=/dev/zero of=LUN0 bs=1024 count=1T seek=1T
Thin provisioning allows us to overcommit our storage, but thick provisioning is easier to maintain. You don’t have to worry about running out of space. Once you have everything provisioned, that’s all you get!
Okay, so now we have our first LUN called LUN0 in /iscsi/storage. Now we need to configure iscsitarget to serve up that LUN. First, since we want Heartbeat to manage iscsitarget, lets remove iscsitarget from rc.d by running the following on both servers:
#sudo update-rc.d –f iscsitarget remove
Now on Server1 only edit /iscsi/iet/ietd.conf with the following:
Target iqn.2011-08.BAUER-POWER:iscsi.LUN0
Lun 0 Path=/iscsi/storage/LUN0,Type=fileio,ScsiSN=random-0001
Alias LUN0
If you want to add CHAP authentication you can in ietd.conf, but I’ll let you Google that yourself. I prefer to lock down my LUNs to either IP addresses or iSCSI Initiators. In iscsitarget, it’s easier (for me) to filter by IP. To do that add the following line in /iscsi/iet/initiators.allow:
iqn.2011-08.SANCLUS:iscsi.LUN0 100.100.10.148
The above example restricts access to LUN0 to only 100.100.10.148. If you want multiple hosts to access a LUN (Like with VMware) you can add more IP’s separated by commas.
Now we’re ready to rock and roll. Reboot both nodes, and start pinging your virtual IP address. When that comes up, try connecting to the virtual IP address using your iSCSI initiator.
It took a long time to write this up, but I’s actually fairly easy to get going. This solution gave my company 20TB of cheap redundant SATA storage at half the cost of an 8TB NetApp!
If you have any questions, or are confused on my directions. Hit me up in the comments.