Herein find documented the current design theory underlying CNS' multi-server TSM cluster.
The University of Florida's central TSM server is used by a variety of on- and off- campus units. It currently hosts some 600 hosts, ranging from small, irregularly touched workstations, up to terabyte mail systtems. Until 2004 we've operated as a single monlithic server; this has simplified some aspects of maintainance, but we've grown large now, and need to do something different.
We serve a diverse university community. Administrative applications, academic support, departmental resources, and commercial interests all make use of our facilities. To support the administrative distinctions these customers need, we currently maintain 48 administrative domains, each of which has different admins able to administer them.
Some of these domains denote actuall inter-organization barriers, but others of them describe storage use distinctions. For instance, there is a domain to support our Division of Continuing Education, but there is also one to support our archives of expired user accounts.
We have many different audiences making use of our TSM infrastructure. In addition to the usual rogue's gallery of servers and workstations, we've got several special-purpose arrangements.
One of the more important clients is our Content Manager installation. This document-management system is the center of much of our Registrar's business process. Its presence places an unusual constraint on the backup system: most TSM servers have daytime more or less free to perform maintainance, and even momentary downtime might be acceptable. Not when you're serving as archive repository for documents.
We're serving as the log-retention archive for many central systems. That means, if someone wants to (say) analyze web logs, it might generate a large flurry of mount requests as the last six month's data is reeled off of tape.
Our server is currently storing about 78TB of data. I offhand expect to be doubling this in each of the next two years.
Our daily traffic tends to wander between 400 and 600GB a night, with spikes over a Terabyte.
We clearly need some variety of offsites, but there's been very little support for the implementation of such until very recently. We anticipate doing this by taking physical tapes offsite; we've experimented with electronic transport of some data, but for most of our applocatons the transport can't keep up with the nightly load of data.
The database for this server, about 98GB, is currently housed on 24 9GB SSA drives. This database is dramatically ungainly to maintain; it's possible that we could reduce its' size by as much as 20% through an unload/reload, but the last time we did that, when the DB was less than 60GB, it took 40+ hours to complete. This is completely unacceptable performance for a service on which day-to-day operations depend.
Heading into the fall of 2003, we were positioned to upgrade the infrastructire on which the TSM server was based. We were fairly confident we could purchase a single new server. it seemed unlikely that sufficient resources to separate our important workloads on different hardware would be available.
This led us to contemplate a 'virtual server' split [...?...]
With the decision in place to generate more than one server instance on the same box, the next issue was to determine how many servers were necessary, and what functions they should each embody.
The first design pass was a simple cluster of functionally independant servers: Each server connected to our 3494 with its' own set of catgory numbers for tape allocation. Each server was to be capable, in principle, of being pried out of the library and re-installed in completely new hardware. The major disadvantage of this design is the complete fragmentation of drives and scratch pools. Each server's scratch would be completely separate, and furthermore moving tapes from server to server would require actually checking volumes in and out of the library. Furthermore, drives would have to be passed around from server to server. This would have been bad enough with a single set of drives, but with 3590s and 3592s, it was simply unworkable. This idea was quickly pitched.
So it was clear that the drives and tapes had to be managed by a single server, and the other server instances should view the tapes as remotely managed. So, I turned my sights on the count of server instances which would be exposed to clients. We have several applications which make use of TSM as a utility, a large number of administrative domains, and several groups which use our TSM server as a sink for their remote volumes. The problem seemed simple: One each.
At this point it became clear that the number was foolishly large. The reasons may be perfectly clear, but I'll enumerate some of the more interesting ones:
The number of different instances would make near-perfect automation of cross-server monitoring and administration a critical precondition for production service. Day one would require perfection. Not likely.
Each server requires its' own cache and buffers: adding these up over 40 instances adds up QUICKLY
Each server requires distinct storage resources: DB volumes, log volumes, storage volumes, etc. This means many, many small volumes to keep track of, get wrong.
Each server needs a separate volume for a DB backup. Every day. Without fail. 40 volumes times seven days retention equals a LOT of wasted space; my 90GB database is fitting onto single volumes now; split that 40 ways?
The complexities generated by simplistic splits having sunk in, I set about dividing my client base into divisions which logically hung together. They fell into two broad categories: Instances that client machines actually log into ("client-contact" instances) and those which served the other instances. ("administrative" instances).
In this design, a server to which clients back up data (here 'EXT') makes use of several other servers for its' resources. It requests tape mounts of a library manager, or 'control' server (here 'CTRL'), and it makes its storagepool copies to a copy server (here 'COPIES') which in turn makes reference to the library manager for its' own tapes.
Library manager instance (administrative). If all client-contact instances view their tapes as managed remotely, then there's no special relationship to work around. TSM DB for this instance is tiny, with thousands of entries total.
TSM DB backup instance (administrative). If we do DB backups 'remotely' to another server , then we have several optimizations available to us: We can store a DB backup on some increment of media smaller than a full volume. We can make copy pools from the primary pools that hold DB backups, and thus store identical DB backups in many places, which seems better than doing many snapshots. We can run many more DB backups simultaneously, since we don't need to allocate a physical device per backup.
Content Manager: (client-contact) CM is a large critical application. It's desirable to insulate this server from oddities in other servers, and vice versa.
GL Mail: (client-contact) the GatorLink Mail complex is our single largest-displacement application. It uses more than 30 Million files , and more than a terabyte of live storage. In many ways it exists in a performance envelope all its own. We'd like to protect mew it up in its' own problem- and failure- domain.
ERP (client-contact) We have an administratvely distinct development operation accreting around the deployment of Enterprise Resource Planning applications. This is a customer base which we would like to insulate from complexities fomeneted by other clients.
Internal clients (client-contact) We have many machines, server and workstation, which are managed by folks who work for our data center. The administrative interactions we can have with these people are different than with our paying customers. From week to week it's a toss-up which set of clients might threaten the others.
External clients (client-contact) We have a variety of paying customers whose behavior might change without warning; reinforcement or moderation of this behavior comes with the admission of next month's bill. Separation seems indicated.
Copy pool clients (client-contact) We have several agreements with other TSM installations in which we serve as their offsite host. Some of the worst database behavior we've seen out of TSM comes as an outgrowth of these arrangements (undetected deadlocks leading to server halts, etc.) Insulation is earnestly to be desired.
The database backup problem created enough tension about wasted space that I began to consider means of performing TSM DB backups without using a full volume each time. The solution seemed to be to perform DB backups to one of the "remote" servers hosted on the same hardware. This would provide the following features:
DB backups become disk-to-disk copies, possibly faster and definitely parallelizable.
DB backups can be copied to remote locations with all the copypool primitives available to normal application data.
Multiple DB backups can be stored on one volume.
Storing many DB backups on the same volume might cause folks to shudder, but so far it appears that doing this will actually improve recoverability. If we perform DB backups directly to a volume, then our option for making "another" copy is to either write another full or to take a snapshot. With this scheme, we can have a given DB backup replicated to N different locations as quickly as bandwidth will allow. Given the incremental nature of stgpool backups, it's not unreasonable to think that 90% of the backup might be offsite even before the primary backup completes.
There's a separate concern about how many 'eggs' (DB fulls) one might store in one 'basket' (physical tape volume). A simple answer to this is that, rather than making three DB backups, you could keep three (or four? more?) copies of the same backup, and occupy fewer physical volumes. This concern will become more and more relevant as we move to larger volumes. 3592 drives hold 300GB raw; reasonable expectations of database compression rates leaves us with an estimate that more than two terabytes of compressed database could fit on that volume. Few folks want a TSM database larger than 100GB; that's 95% wastage.
Another concern is that this generates a recursive recovery problem for recovery: in a site-wide disaster, it will be necessary to restore the database-backup (DBB) server before real work on restoring the other servers can begin. And where does the DBB server back up its own DB?
The answer to this problem is straightforward once we consider that the number of "files" (virtual volumes) resident on the DBB server is rather small compared to most "normal" TSM servers. For example, if we back up that 100GB database onto 40GB virtual volumes, then we've got three or four files a day. Even if we back up 40 servers and retain DB backups for an entire year, that's still under 60,000 files. We get this many files from the initial incremental of a moderately populated server. With more reasonable numbers (say a month retention on 20 servers) the number is even smaller: 2400 files, and a DBB server database probably measured in tens of megabytes.
A DB this small is quite reasonable to back up to FILE devclasses, which could be distributed to many locations via conventional means; RSYNC on the devclass directory would probably be entirely adequte, and could be run hourly, change being in fact minimal.
On further reflection, it seemed reasonable to make this same instance into the host of the library managers; this is another relatively small database load, and is similarly in the critical path of recovery of other servers, so it seems reasonable to combine the functions.
TCPADMINPORT. The first problem I encountered with this configuration is that, regardless of which port is specified as the server port, the server listens on 1580 for administrative commands. It's necessary to also set the TCPADMINPORT to the same port as the TCPPort.
Mount thrashing . Since all of the tape mounts were to virtual volumes, the default mount retention (60 minutes) created a 'thrash' of mount points when the target server is writing directly to tape. For example:
Server SOURCE completes virtual volume 'A' on server TARGET, writing to tape volume T01, and closes it. Volume 'A' is retained in 'mounted' state on SOURCE for 60 minutes, which also retains volume T01.
Server SOURCE requests a scratch virtual volume 'B' from TARGET. SOURCE has authorization for more than one mount point, so TARGET has to mount tape volume T02 to write the volume. When volume 'B' is complete and closed, it too is retained in a mounted state, thus pinning T02.
This gets especially interesting when more than one SOURCE server is attempting to access the TARGET server at the same time. If N servers are working with at least N+1 mount points, then every single new virtual-volume mount will request a new physical-media mount. Of course, this plays havoc with efficiency. The solution to this problem was to set mount retention of the virtual volume deviceclass to zero: In this way, the old volume is "dismounted" (and thus the underlying tape drive is, too) immediately. This permits new mount requests from the same server to be granted from the same tape. In this structure we only need sufficient mount point to service the number of simultaneous sessions.
Update server forcesync doesn't seem to work.