Computing with HTCondor
HTCondor is a specialized workload management system for compute-intensive jobs. Like other full-featured batch systems, HTCondor provides a job queueing mechanism, scheduling policy, priority scheme, resource monitoring, and resource management. Users submit their serial or parallel jobs to HTCondor, HTCondor places them into a queue, chooses when and where to run the jobs based upon a policy, carefully monitors their progress, and ultimately informs the user upon completion.
While providing functionality similar to that of a more traditional batch queueing system, HTCondor’s novel architecture allows it to succeed in areas where traditional scheduling systems fail. HTCondor can be used to manage a cluster of dedicated compute nodes (such as a “Beowulf” cluster). In addition, unique mechanisms enable HTCondor to effectively harness wasted CPU power from otherwise idle desktop workstations.
For instance, HTCondor can be configured to only use desktop machines where the keyboard and mouse are idle. Should HTCondor detect that a machine is no longer available (such as a key press detected), in many circumstances HTCondor is able to transparently produce a checkpoint and migrate a job to a different machine which would otherwise be idle. HTCondor does not require a shared file system across machines – if no shared file system is available, HTCondor can transfer the job’s data files on behalf of the user, or HTCondor may be able to transparently redirect all the job’s I/O requests back to the submit machine. As a result, HTCondor can be used to seamlessly combine all of an organization’s computational power into one resource.
The ClassAd mechanism in HTCondor provides an extremely flexible and expressive framework for matching resource requests (jobs) with resource offers (machines). Jobs can easily state both job requirements and job preferences. Likewise, machines can specify requirements and preferences about the jobs they are willing to run. These requirements and preferences can be described in powerful expressions, resulting in HTCondor’s adaptation to nearly any desired policy.
HTCondor can be used to build Grid-style computing environments that cross administrative boundaries. HTCondor’s “flocking” technology allows multiple HTCondor compute installations to work together. HTCondor incorporates many of the emerging Grid and Cloud-based computing methodologies and protocols. For instance, HTCondor-G is fully interoperable with resources managed by Globus.
HTCondor is the product of years of research by the Center for High Throughput Computing in the Department of Computer Sciences at the University of Wisconsin-Madison (UW-Madison), and it was first installed as a production system in the UW-Madison Department of Computer Sciences over 15 years ago. This HTCondor installation has since served as a major source of computing cycles to UW-Madison faculty and students. Additional HTCondor installations have been established over the years across our campus and the world. Hundreds of organizations in industry, government, and academia have used HTCondor to establish compute installations ranging in size from a handful to many thousands of workstations.
High Throughput Computing (HTC)
For many experimental scientists, scientific progress and quality of research are strongly linked to computing throughput. In other words, most scientists are concerned with how many floating point operations per month or per year they can extract from their computing environment rather than the number of such operations the environment can provide them per second or minute. Floating point operations per second (FLOPS) has been the yardstick used by most High Performance Computing (HPC) efforts to evaluate their systems. Little attention has been devoted by the computing community to environments that can deliver large amounts of processing capacity over long periods of time. We refer to such environments as High Throughput Computing (HTC) environments.
For more than a decade, the HTCondor team at the Computer Sciences Department at the University of Wisconsin-Madison has been developing and evaluating mechanisms and policies that support HTC on large collections of distributively owned heterogeneous computing resources. We first introduced the distinction between High Performance Computing (HPC) and High Throughput Computing (HTC) in a seminar at the NASA Goddard Flight Center i in July of 1996 and a month later at the European Laboratory for Particle Physics (CERN). In June of 1997 HPCWire published an interview on High Throughput Computing.
The key to HTC is effective management and exploitation of all available computing resources. Since the computing needs of most scientists can be satisfied these days by commodity CPUs and memory, high efficiency is not playing a major role in a HTC environment. The main challenge a typical HTC environment faces is how to maximize the amount of resources accessible to its customers. Distributed ownership of computing resources is the major obstacle such an environment has to overcome in order to expand the pool of resources it can draw from. Recent trends in the cost/performance ratio of computer hardware have placed the control (ownership) over powerful computing resources in the hands of individuals and small groups. These distributed owners will be willing to include their resources in a HTC environment only after they are convinced that their needs will be addressed and their rights protected.