Originally published in the June 1997 issue of ;login: magazine.
This article describes a scalable news architecture built around a single news spool. This architecture is in operation at EarthLink, supporting over 1700 concurrent newsreading connections and will scale, without alteration of the general principles, to well over 10000 concurrent client connections. This article enumerates the advantages of such a system, the problems that we have overcome, and the hurdles we are likely to face in the near future.
USENET News has become one of the most difficult of the core Internet service protocols to manage effectively, especially for very large numbers of users. Current best practice for managing news services has been to generate a carefully tuned machine running INN 1.4 or better, storing the large article spool on locally attached storage media, and allowing NNRP connections from some number of customers until the load becomes too great for the machine to provide adequate service. At this time, one is forced to upgrade the hardware or add another server of similar configuration and split customer load statically onto a distinct server or dynamically use a master slave replication.
This article explores the problems inherent in the standard news system, describes how our service works, and explain why we think this is the right way to scale USENET News service to very large numbers of users.
It's fairly straightforward to support USENET news service for 100-200 concurrent client processes. One can do this relatively inexpensively on a carefully chosen set of hardware using a high end Pentium-based UNIX box or other decent UNIX workstation. One maintains a single INND (or comparable NNTP server process) on a machine that has sufficient resources to devote to running one nnrpd process for each concurrent client connection. With today's USENET news volume, which we currently measure as averaging in excess of 2.5 GB/day, one also has to allocate spool disk space, which may be distributed or striped across multiple disks or a RAID system. A service outage occurs if any of the disk partitions used by news fills or the server encounters an error from which it cannot gracefully recover. In times of high load, the nnrpd processes can steal performance from INND; thus the platform should be designed with sufficient capacity that it can catch up with its feeds during off peak hours. All in all, this is a fairly straightforward procedure and can support up to a few hundred concurrent client connections by a sufficient application of hardware.
One of the problems with this is that this server is a single point of failure for news service. Some of the realities of news service are that operating systems crash, hardware fails, or INN stops running properly. In this model, any of these scenarios renders news service unavailable. Although this is probably acceptable in a corporate environment, where USENET news service would not be considered mission critical, ISP environments that deliver Internet services as the core business find critical reliance on one machine like this to be undesirable without at least some additional safeguards.
Another problem one can face in maintaining news service is with the design and performance of most standard filesystems. When USENET first started out, or even when INN was first released, no one imagined that 9000+ articles/day would be arriving in a single newsgroup. Fortunately or unfortunately, this is the situation we now face. Most filesystems in use today, such as BSD's FFS (McKusick, et al., 1984) still use a linear lookup of files in directory structures. With potentially tens of thousands of files in a single directory to sort through, doubling the size of the spool (for example, by doubling the amount of time articles are retained before they are expired) will place considerably more than double the load on the server. There are filesystems available that use a hashing directory lookup, like SGI's XFS (Sweeney, et al., 1996) and the Veritas (Veritas Software, 1995) file system, but these are not available for all platforms even though they produce considerable advantage when used on news servers.
Although contemporary file systems try to do as much work as possible asynchronously through methods like write buffering and read-ahead, most still perform some synchronous operations like inode creation/deletion. This doesn't create a significant overhead in writing to large files, but if one is creating and deleting many small files, as one does on a news server, a performance penalty will be paid while the inodes in which the articles are to reside are created or deleted synchronously before the next operation can be executed by that process or process thread. Some file systems, like Linux's ex2fs (Card, Ts'o & Tweedie, 1994) perform their inode manipulation asynchronously. This results in a considerable performance benefit, but at the price of filesystem stability in the event of a machine crash. Again, this may be an acceptable cost/benefit trade-off in an environment where news is not mission critical.
One can safely mitigate this somewhat by using RAID controllers with an NVRAM write cache or Legato's Prestoserve (Legato Systems) card, which accept and immediately acknowledge the inode request and save it into battery backed storage that will survive a machine crash or power outage.
A number of other programming efforts are under way to address filesystem limitations within INN itself. One of several proposals is to allocate articles as offset pointers within a single large file. Essentially, this replaces the relatively expensive directory operations with relatively inexpensive file seeks. One can combine this with a cyclic article allocation policy. Once the file containing articles is full, the article allocation policy would "wrap around" and start allocating space from the beginning of the storage file, keeping the spool a constant size. This eliminates the need for expire to run and prevents the disk from filling up (Fritchie). Although these new techniques are very promising, they are still experimental and not appropriate for use in a mission critical environment.
A big problem with the classic news architecture is what happens when the service runs out of capacity without a clear upgrade path. This can happen either by a growth in the number of people interested in reading news or, inevitably it seems, by a growth in USENET news volume. Historically, USENET news volume has about doubled every 16 months, punctuated by occasional bursts of growth (Swartz, 1993). One could easily build a machine today that could adequately handle service requirements only to find that 6 months or a year later it is wholly inadequate to the task.
One can regain performance, as well as disk space, by reducing expire times, but that only works well only when one must cut them in half about every year to keep volume constant. Reducing them further for performance reasons is likely to be viewed as undesirable. One can also regain performance by dropping part of the feed, but this also is not an elegant fix.
When one is unable to upgrade the hardware beyond that point, one either must do a forklift upgrade or split the service to run on multiple machines. There comes a point where hardware upgrades are no longer practical or cost effective. The other alternative, splitting the service, creates a number of problems. First, in our experience, most of the energy required to maintain news service is devoted to interacting with INND. This is not a knock against INND; it is by necessity a complicated piece of software performing complicated tasks. Nevertheless, after a while, it doesn't make good business sense to have support costs (in personnel) scale linearly with the number of customers using a service.
News clients keep state information for the news server that they talk to. If the servers are not carefully synchronized with regard to newsgroups, article numbering, and article ordering, clients are unable to switch between them without losing state and causing consternation on the part of the people using the service. If the machines are not synchronized, one must maintain a customer balance between them. In a corporate environment, this isn't too difficult; but in an ISP environment, when from month to month many new customers join and many old customers leave, keeping the load balanced is very difficult. If one can adequately replicate the servers' state, then one can do decent load balancing using Round Robin DNS.
One of the key factors in increasing service performance and decreasing the amount of time necessary to maintain a given service level is keeping servers performing identical tasks identical in both their hardware and software states. If one is balancing load over six news servers, the service will be much easier to maintain if they are identical in every aspect. Because INND and news in general requires some periodic modification, it is easy for the state on each server to "drift." With sufficient discipline this can be minimized; but given that the typical experienced system administrator is likely a fairly busy person, it is almost impossible to avoid altogether because attempts to combat state drift are never on management's to do list. Note, though, that this is in general a volatile situation and architectures that centralize maintenance and modification functions are to be preferred over those which do not.
At EarthLink, we require a highly robust service environment resistant to single points of failure that, due to the phenomenal growth of our company and the Internet access business in general, must be straightforward to scale to an arbitrary size. We have to keep our services functional 24 hours a day, seven days a week. Further, system performance must be adequate, and we wish to minimize system maintenance requirements.
What we have done is build our news service, among other services, as a large "virtual server" using high performance network accessible data storage devices as our "virtual disk" storage-system, multiple UNIX-based workstations functioning as a very loosely coupled parallel processing engine, and a high-speed FDDI network as the "backplane" to this virtual server. Because of its ubiquity and because several vendors have highly reliable, extremely well tuned hardware solutions, we use NFS as our data transfer protocol.
The heart of the service is a set of robust, high-speed, dedicated network fileservers exporting the USENET news spool and the configuration area, where the active, history, and configuration files live, to a number of dataless UNIX servers. These filesystems are exported to each computer that is a functional part of our news service. The devices we have chosen for this are the Network Appliance (Hitz, 1995) family of fileservers, although one could substitute other NFS fileservers such as the Auspex (Auspex Systems) or EMC Symmetrix/DART (EMC Corp.) fileservers.
The Network Appliance fileservers employ all of the favorable file system characteristics mentioned previously. The WAFL (Hitz, Lau, & Malcolm, 1994) filesystem is highly tuned for NFS performance, uses hashed directory structures, provides NVRAM for write caching, and provides considerable standard RAM for read caching. It also employs and is tuned for RAID 4, which enables us to lose a single disk anywhere in the common news area without losing the filesystem and, therefore, the service. Because the whole filesystem is tuned for RAID operation, we do not see the same write penalties that plague many other RAID systems. Another advantage of using something like the Network Appliance is that it is special purpose hardware designed and built solely to deliver high-speed file access via NFS. As such, simplicity of operation was an important design parameter, making it robust and easy to maintain.
Only one machine in our news service, the news feed server, mounts the news data as a writable filesystem. This machine is the sole point of data entry from the filesystem. It is also the only machine which runs INND. All the feeds come into and go out of it, and it receives all the logging information from all the other servers. We have only one place to update access permissions, expire times, group additions and subtractions, etc. We have only one machine on which to monitor the INND process and no copies of our news state to keep synchronized.
In addition, we have multiple nnrpd servers which mount the configuration area and the spool read only. In addition to preventing these machines from scrambling the key news files, this also aids performance because there are certain types of NFS attributes which no longer need to be communicated between NFS server and client. The nnrpd process is started out of /etc/inetd.conf and set to forward all postings on to our feeder machine. It also has a storage area (again, NFS mounted) where it can store these postings if, for some reason, the feeder machine is throttled or otherwise unavailable. If this happens, it periodically tries to resend these postings until they get through. We have instructed syslog.conf to forward a log messages back to the feeder machine where they may be processed en masse to gain statistics and information about the general health and utilization of the service.
All these NNRP servers are essentially clones of each other and contain no volatile data. This means that, first, they require very little maintenance and thus there is little opportunity for state drift to occur. Second, because they are clones of each other, it takes very little effort to add a new one into service if there are service capacity issues. Third, we use Round Robin DNS to select which machine a customer connects to, and we make certain to have more machines in service than are required at peak times. So even if one crashes or its hardware fails, we can remove it from the Round Robin and inspect/repair/rebuild it at our leisure with no loss of functionality beyond that of the original crash.
This architecture also allows us to scale relatively easily. By using a switched network carefully designed to not be oversubscribable, we eliminate network loading as a scaling criterion. If NNRP server horsepower is lacking, we can add another cloned server to handle the increased load in very little time. If we run out of capacity or performance on the file storage, we can, with minimal downtime, split it across multiple fileservers. All we have to do is turn off the writing by the feeder and copy the portion of the spool and/or configuration files to be split while client postings gracefully accumulate. Then with a single reboot, each machine sees the newly split spool with increased performance. If it becomes necessary, the Overview spool can be split up in the same manner because it has a directory structure that mimics the article spool.
In this architecture, we gain performance in a cost-effective manner by adding more spindles (extra disk drives over which we can balance the I/O loading) to the fileservers. As a consequence, we have a spool space that is much larger than we need for storage, even with long expire times. In fact, our expire times are currently limited by client ability to download the Overview database and articles over a dialup phone line, rather than our storage or performance capacity.
Still another advantage is that the only ways client load can influence the feeder machine are by loading the fileservers and sending large numbers of postings to the feeder. This helps to isolate our feed performance from the client demand.
Although we think this is a superior architecture for large numbers of client news readers, it is not without its limitations. There are several hurdles we've had to overcome.
First, each I/O file operation must pay the cost of going through the NFS and network code in addition to the file system portions of the operating system. Although we see much better write performance and comparable read performance over a high-speed network to our fileservers than we would to busy local storage, we exercise more CPU activity. Further, NFS in general is not a protocol well suited to high-performance network transactions of the sort we describe here. However, this is mitigated by the considerable tuning done by the dedicated NFS fileservers.
The news feeder machine is the worst for this because it is not a distributable process; only one INND can update the active and history files at any one time. Consequently, it must be a fairly powerful machine if it is to keep up with USENET traffic while dealing with the NFS write overhead with every article it receives.
Another problems is that INND is not well tuned for operation over NFS. We've tuned out some of the major bottlenecks in the code, but a great deal of work in this area remains to be done. This is most evident in accessing the active and history files. The history file is constantly growing, and the active file is continually being modified at more or less random locations. Many of the standard INN tuning tricks to increase performance actually decrease it if the active and history files are NFS mounted.
Some operating systems still use archaic network message buffering systems such that once an mbuf is allocated, it is never freed. A brief flood of simultaneous requests to a given server can cause it to allocate nearly all its memory to mbufs, which are then never freed, making the machine unusable and forcing a machine reboot. It's important to avoid such operating systems wherever possible in this architecture.
Also, our customer base is growing much faster than Moore's Law (and other laws) predicts computer component performance increasing, so we continually end up with more and more nnrpd servers and fileservers. With the nnrpd servers, we can and have upgraded to larger classes of machines to make maintenance easier by having fewer servers to deal with (while still maintaining N+1 redundancy), but with fileservers, the spool splits come more and more frequently as we grow.
With our split spool, we actually violate one of our design parameters by replacing a single point of failure with multiple single points of failure. Given the robustness of the fileserver and the reduced maintenance cost by running only one INND, we believe this risk is justified.
Finally, one of our biggest concerns is that our fileservers are getting so powerful and access to the news spool is so scattered and random that we will soon be running into a performance limitation dictated solely by the I/O capability of each disk drive. We don't see breakthroughs in magnetic disk technology that are likely to be of much assistance to us here, and to limit the eventual deployment of a very large number of file servers, we are hoping for price and deployment breakthroughs of solid state disk technology.
Within the news storage system that INN uses, a record of the existence of articles is kept in no less than four places: the active file, the history file, the Overview database and, of course, the article spool. One of the toughest tasks of a news administrator is keeping all four records of each news article synchronized. We believe that not only has this gotten too complex, but the way data is stored in each module has reached the end of its useful life. That is to say that, at the very least, data stored in flat files really ought to be stored in hashed files, data stored in hashed files should be stored in true databases, and data stored in directory trees ought to be stored more efficiently.
The time has come to rethink how news is stored. First, philosophically, the USENET Store and Forward model has reached the end of its useful life. A more WWW-like Distribute Reference and Cache model would work much better: a scheme whereby only headers for the articles are transmitted through the network and the body is fetched from the posting site on request and then cached locally in a large, efficiently organized data store before being expired.
This philosophical point notwithstanding, maintenance of a news service can be made easier, and the lifetime of Store and Forward can be extended, by merging the history and active files (and perhaps the Overview database as well) into one true SQL accessible database. This is something that the ISC in its development of INN should seriously consider. Even if they don't add it themselves, we'd like to see an insertion point placed in the code and the INN API extended to cover alternate sources of these repositories so someone else could extend INN in this manner while maintaining compatibility with an evolving INN code base. In addition, the news spool itself needs to be restructured; we consider the directions Fritchie as well as others are taking to be promising.
Although it is relatively straightforward to provide USENET news access to up to a few hundred concurrent client processes, we believe we have demonstrated a highly robust, minimal maintenance, high-performance news system based around a single INND and single news spool.
The locally spooling nnrpd code was written by Dave Hayes, email@example.com, and is available on request. We'd like to thank the EarthLink Network Engineering staff for all their efforts in installing and maintaining this service, and we'd especially like to the the EarthLink system administration team: Gianni Carlo Contardo, Jason Savoy, Marty Greer, Alwin Sun, Jewel Clark, Mike Montgomery, Tom Hsieh, Lori Barfield, and David Van Sandt. Further, thanks goes to Tim Bosserman, Dave Hayes, Scott Fritchie, and Karl Swartz for their constructive comments on the original draft of this article.
Auspex Systems, Inc. http://www.auspex.com/Product_Info/TechReports.html
Card, R., Ts'o, T., and Tweedie, S., "Design and Implementation of the Second Extended Filesystem," Proceedings of the First Dutch International Symposium on Linux, December 1994, http://www.redhat.com:8080/HyperNews/get/fs/ext2intro.html.
EMC Corp., http://www.emc.com/products/hardware/network/snfs/datasheet/snfs_1.htm
Fritchie, S. Private communication. firstname.lastname@example.org
Hitz, D. 1995. "An NFS File Server Appliance," http://www.netapp.com/technology/level3/3001.html
Hitz, D., Lau, J., and Malcolm, M. 1994. "File System Design for an NFS File Server Appliance". Proceedings of the 1994 Winter USENIX, San Francisco, CA. 1994, 235-246.
Legato Systems, Inc. http://www.legato.com
McKusick, M., Joy, W., Leffler, S., and Fabry, R., "A Fast File System for UNIX," ACM Transactions on Computer Systems vol. 2, no.3, August, 1984, 181-197.
Swartz, K., "Forecasting Disk Resource Requirements for a Usenet Server," Proceedings of the USENIX Systems Administration (LISA VII) Conference, Monterey, CA, 1993, 195-202.
Sweeney, A., Doucette, D., Hu, W., Anderson, C., Nishimoto, M., Peck., G. "Scalability in the XFS File System," Proceedings of the USENIX 1996 Annual Technical Conference. San Diego, CA, 1-14.
Veritas Software, 1995. http://www.veritas.com/News/fs-perf.html