The servers drive me nuts. Always complaining about something. So I set out late last year to make my life easier. A few minutes ago I launched the project that I call ClusterF. dNeero.com is now running on it and, knock on wood, all seems to be ok.
Things get hairy when your site outgrows its first web server. To scale you start to run it on two servers. This is called clustering. It sounds simple, but it isn't. When a website visitor goes to one of the servers any data they change needs to be visible to website visitors on the other server. So there's a system that synchronizes data between the machines.
Then two servers becomes three, becomes four, etc. Before you know it things are complex and tedious. And, believe me, I design for dead simplicity. I'd rather write 10,000 lines of code than do something manually each time I want to deploy a new build, for example. But complexity creeps in.
My solution was to build my own little Tomcat provisioning system. My design goal was to be able to add a new server to the cluster in two minutes or less. Put a single file onto the server, double click to install, provision the instance and you're up and running. In tests I was able to do two minutes but in production it makes sense to check everything you do so it's realistically more like five minutes to set up a server. Not bad considering that it used to be around an hour... and at that I considered my processes fairly lightweight.
Here's how it works. I install ClusterF onto each server in the cluster. ClusterF includes clean copies of Tomcat and a Java JDK. Using a GUI (set of screens with buttons... a desktop application) I can control any of the servers in the cluster. I can, for example, create a new Tomcat instance.
To create a new Tomcat instance ClusterF copies the clean Tomcat into a deploy directory. It then uses a Java wrapper to make it executable as a service on Windows or Linux... this is done by copying four or five files into the instance's Tomcat directory and editing some config files... all automatically. Next, ClusterF configures Tomcat's server.xml file so that it'll cluster sessions with other instances... or create itself as a separate cluster if so chosen in the GUI.
Once the instances are up and running I can define an App. An App has a set of config properties like database connection strings, etc. I can configure these properties centrally and then ClusterF will make sure they're available as a properties bundle each time an instance is run. I can run multiple Apps per instance by choosing different root directories in the GUI.
Using the GUI I associate Apps to instances, saying essentially that I want App X to run on Servers 1, 2 and 5. I can then deploy a WAR file. ClusterF allows me to choose the WAR file centrally and then it distributes it to the cluster. This was actually quite tricky. I'm clustering using JGroups. I had to break the file into little chunks and send each one as a separate part. JGroups doesn't (yet) handle big payloads well. On the other side I re-assemble the file and process it. Of course, I have to have error checking (MD5 checksum), receipt management, versioning, etc to make sure that the WAR deploys properly.
ClusterF allows me to start/stop all instances running an app with a single button click. And it's cluster smart... meaning that it'll start one instance and give it a 60 second head start so that it'll establish itself as the primary controller in the cluster... then it'll start other instances.
So far I've saved myself time setting up Tomcat instances, managing config files and deploying builds. The last thing I wanted to save time with was restarting app servers.
I built a monitoring system into ClusterF. Each instance pings the others in the cluster periodically to make sure they're up. If they're not, ClusterF can restart them automatically, after set periods of time, etc. I'm essentially trying to make the system self-healing when things like out of memory errors happen.
When things go down I get cell phone pings and emails. I want to be informed, of course, but my hope is that I'll be able to watch ClusterF work through the issue itself from afar. Every now and then I'm sure I'll have to intervene.
ClusterF is a remote control for Tomcat instances. I didn't want Tomcat instances to ever run inside of ClusterF's runtime. Doing so would have introduced another level of possible failure on top of Tomcat. I didn't want that. The Tomcats and ClusterF run independently of each other. If ClusterF goes down I lose some monitoring and self-recovery capability but the Apps stay up. They're decoupled.
Work on ClusterF was always secondary. I got a few hours here and a few hours there for a couple months in Dec 07/Jan 08. Then I got pulled away with other things. A couple weeks ago, with the servers being a pain in the ass, I decided to revive ClusterF. Tonight I finally got it up and running.
Of course, we all know that when I release something it's virtually guaranteed that I'll spend the next 24-48 hours cussing up a storm as it hits the fan. We'll see how this one goes.
Plans for the future of ClusterF? I'll add stuff as I need it. More robust page checking would be nice... the ability to specify a URL and define a string that must come back or else it's defined as a fail. The GUI sucks, but works... I could use some cleaning up there. I'd also like to build a web interface. To do so I'll use NanoHTTPD and write simple pages that have basic status reporting and controls (start, stop, disable, etc.)
The name ClusterF? It's software that makes clustering possible. I can't remember what the F stands for... oh, wait... yes I can. At startup: "Now we're F'd."