About a week ago my company did our second, and hopefully our last data center move in a year. Our first move was because we were using a managed colocation (colo) company to host our equipment. What is that you ask? Basically it’s a middle-man that rents out a cage at a local colocation facility, then sub-rents you a cabinet to put your stuff. They provide you with an Internet connection, and in our case they also managed our firewalls. Anyway, they wanted to move us from one data center to another, so we went ahead and did that.
After the move a month or two later we started having these strange network hiccups where one of our VLANs would stop routing traffic for a few minutes, then come back up. It was causing our production web servers to drop connection with the database and would cause our sites to go down. I tried working with our managed colo provider to help troubleshoot. Stuff like, asking for the firewall logs so I can see if there was a problem with the firewalls or the switches or both. Working them was kind of like asking your neighbor to kick you in the balls. It wasn’t very fun. After a while we decided to kick them to the curb and go direct with American Internet Services. With that move we were also going to save about $18,000 per year and a heap of head aches.
Or so I thought. Like I said, about a week ago we performed our move away from our previous company, and into a new cabinet just down the hall in the same data center. Our setup in our old cabinet was with an active/passive Sonicwall NSA 2400 failover cluster. We had two redundant internet connections that came out of a little biscuit in our cabinet directly into the X1 ports of the primary and secondary firewall appliances. Not really thinking about it, I thought that when I plugged those ports into the biscuits in my new cabinet, everything would work great. No, sorry folks.
You see, our new colo provider uses HSRP for creating a redundant default gateway. If you don’t know what HSRP is, Wikipedia describes it saying:
Hot Standby Router Protocol (HSRP) is a Cisco proprietary redundancy protocol for establishing a fault-tolerant default gateway, and has been described in detail in RFC 2281.
The protocol establishes a framework between network routers in order to achieve default gateway failover if the primary gateway becomes inaccessible,in close association with a rapid-converging routing protocol like EIGRP or OSPF.
When we plugged in our HSRP connections directly to the firewall our internal network worked fine, but for some reason our DMZ in transparent mode could not ping out to the internet. I shut down our switch port connecting the DMZ VLAN to the firewall, and brought it back up and then the servers could ping out to the internet. The problem we saw after that though was connectivity was intermittent. Some servers could talk on the internet, and others could not. We also found that when sending traceroutes to our public websites some of them would make it, others would get dropped by our firewall. We finally unplugged one HSRP connection just to get everything working for the time being until we could figure out what the hell was going on.
Well it turns out that I’m not the only one experiencing an issue with Sonicwall and HSRP. Like most IT guys when faced with an issue I turned to Google. Google didn’t have a lot to say, but I did find this thread where a guy had an issue, but there was no resolution.
After speaking with an engineer at American Internet Services, he suggested placing a layer 2 switch in-between the HSRP connections and the Sonicwalls. He said:
I firmly believe that the issue with the HSRP is that there is no level 2 connectivity between the links as you have them plugged into two separate Sonicwall systems.
HSRP works by having the two routers communicate with each other and ensuring that one router is "Active" and the other is "Standby". If they are unable to communicate with each other, they will both become "Active" and they will both announce themselves as the default gateway which can cause collisions and packet loss.
If you have a little 4-Port switch, you could plug the two links from our routers and the two links to your Sonicwalls into it and, I believe, that would solve the connectivity issue.
It made sense to me, and it was also backed up by this KB article on how to configure high availability on a Sonicwall in the first place. Here is a diagram from the KB article that shows that you need a switch on the WAN side to make HA work correctly.
I didn’t want to rely on a shitty little 4-port switch to hold up my entire internet connection though, so I decided to take four ports on my two core switches, put them in a new VLAN, and make them access ports. I then ran my HSRP links into one port on one switch, and one port on the other switch (You know, in case one switch died, the Internet will stay up!) I then plugged my firewall into the other two ports. Now both HSRP connections are working correctly, and we have redundant internet again. Plus our transparent DMZ works fine with this configuration.
It dawned on me that since our previous provider was sub-renting our cabinet that our biscuit in the old cabinet must have been connected to a core switch or something on the back end which handled all of their customers internet connections. That’s why we were able to plug the Sonicwalls directly in without having to put a switch in front. Just a hunch really. They won’t tell me how theirs was setup because they are pissed at me. I guess I understand that since I did fire them for being terrible.
Although it was rough figuring this one out, it like any hard issue I run into, forced me to learn more about how all of this works. In the end I dig this stuff because it makes me a better admin, and I can take it with me further in my career.
Do you have a similar setup to this at your colocation facility? Did you solve it differently? If so, let me know what you did in the comments!