Академический Документы
Профессиональный Документы
Культура Документы
Norm Snyder
Contents: Introduction.........................................................................................3 Why IBM is doing Linux Clusters........................................................4 IBM Linux Cluster Offerings................................................................6 Hardware Software Systems Design and Specification Diagram of a large Linux Cluster IBM
Cluster 1300................................................................11 Hardware Software Solution Design and Specification Future Generations
Advantages of IBM Linux Clusters......................................................13 IBM Solution Differentiators Systems Management RAS for IBM Components IBM Linux Cluster Packaging Advantage Conclusion..........................................................................................15
Page 2 of 17
Introduction
Clustered computing has been with us for several years. It represents an attempt to solve larger problems, or to solve problems in a more cost effective manner, than the more conventional systems of the time. Greg Pfister, in his wonderful book In Search of Clusters, defines a cluster as a type of parallel or distributed system that: consists of a collection of interconnected whole computers, and is used as a single, unified computing resource. Clusters have been devised, formally or informally, from many types of systems. On one extreme, the IBM System/390, now the zSeries, when configured in a Sysplex, represents a cluster. One could argue IBM zSeries with VM and tens or hundreds of Linux images might be that an IBM considered a cluster as well. The IBM RS/6000 SP may be viewed as a cluster of RS/6000 servers (or RS/6000 server technology) packaged in a proprietary frame or rack. Both of these examples include a highly functional software and hardware interconnect that allows the cluster to be viewed and administered as a single system. I would assert that any cluster of interest should have this characteristic as well. At the other end of the power spectrum, many organizations have assembled Intel (and other) servers into clusters of various types. Proprietary solutions such as Microsoft Wolfpack and Compaq Alpha TruCluster compete for mind share with the generic Beowulf clusters on Linux Intel boxes. Beowulf, in fact, was the outgrowth of NASA researcher Donald Beckers solution to the problem of creating a supercomputing resource without having a supercomputing budget. It is the latter type of cluster that will be addressed in this paper. One may also characterize clusters by their function: High Availability: Redundancy and failover for fault tolerance. High Performance: Lots of systems working together on a single problem. A FLOP farm. Server Consolidation: Central management of resources dedicated to disparate tasks. Initial efforts in Linux clustering have been in the High Performance Computing (HPC) area, and the forthcoming IA-64 boxes (with twice the floating-point capability) will continue to make this area vital. High Availability solutions for Linux clusters are now available from several ISVs, and span a range of capability and complexity. HA clusters are expected to become common in the near future. Server consolidation should solve a management problem, not a technical problem. It should no doubt become important as Linux clusters become mainstream.
^^
Page 3 of 17
Page 4 of 17
correctly implements the standard. Parallel programs using the MPI model can therefore run on Linux clusters. Tools and math libraries are available in the Open Source community, and middleware, like the General Parallel File System (GPFS) and DB2 have, or will have, Linux versions. IBMs interest in Linux clusters on Intel architecture platforms is simple. Our customers are interested in Linux clusters on Intel, for all of the reasons mentioned above. IBM brings skills and knowledge that no other vendor can bring to the table. IBM is uniquely qualified to design and install large Linux super clusters, with all of the infrastructure that is needed to make them systems, as opposed to collections of independent boxes. As did parallel RISC systems in HPC, Linux clusters are moving from the early adopter stage into the mainstream stage. This should be enabled by technologies such as high performance file systems, high availability software, and the like. As the Linux cluster systems become mainstream, the business opportunities should increase proportionately.
Page 5 of 17
Beginning in 2000, IBM has created Linux cluster systems from IBM xSeries (or the older IBM Netfinity) rack-mounted systems, integrating them with appropriate networks, a systems management layer (hardware and software), and necessary services. The newly announced IBM Cluster 1300 represents a formalization or productization of the previous custom built cluster technology. These initial offerings consisted of custom configured hardware and software to meet the customers needs, coupled with appropriate services to do custom installation, and necessary support. They were sold as special bid systems, as opposed to formal products. The basic components of a Linux cluster, either a Cluster 1300 or a custom offering, are: Rack-mounted Intel architecture systems are the basis of the Linux cluster. Dense packaging is a requirement, with standard 19 racks being the favored package. Within the racks we find nodes, fast interconnects such as switches or fabrics, network management hardware, terminal servers, and the like. Nodes may be functionally grouped into two categories: (1) Compute nodes perform the computational problem for which the system is designed. (2) Infrastructure nodes, such as head nodes, management nodes, and storage nodes, provide systems management and specialized functions needed to make the collection of compute nodes into a system. Compute nodes need to be as dense as possible, and have few connectivity requirements. The inclusion of a service processor for systems management functions (to be described later) is essential. The IBM xSeries 330 is the standard compute node for the Cluster 1300. It allows up to 2 CPUs with robust memory and onboard disk, in a 1U form factor. A U or Unit is 1.75 inches of height in a standard 19 inch rack package. The x330 has a built-in service processor and two slots for connectivity to the other components of the system. Head nodes, management nodes, and storage nodes provide special function for management of the cluster (like boot support, hardware management, external I/O, etc...). The IBM xSeries 342 is used as the management and storage node in the Cluster 1300. Switches or other fabrics are used for inter processor communication for parallel programming, and for various management functions. For parallel programming (Inter Process Communication, or IPC) , a commonly used switch is the MyrinetTM switch from Myricom. The Myrinet switch is a very fast, very scalable high bandwidth switch. Like all true scalable switches, as the number of nodes attached to the switch increases, the aggregate bandwidth goes up proportionately, and the latency remains constant. Said another way, the bandwidth on each path is the same and the number of paths depends on the number of nodes, with each node having a path to all other nodes regardless of the cluster size. Per path bandwidth is approximately 200 MB/sec each direction with a latency in the 6 to 8 microsecond range. Communication is user space to user space, and may be over IP or GM, Myricoms user-level software.
Hardware:
Page 6 of 17
If the parallel programming environment requires less inter processor communication, a less robust interconnect fabrics such as Ethernet (at lower prices) might be substituted. GigaNet , Quadratics , SCI , or ServerNet might also be chosen for a custom offering. InfiniBand may be viable in the future. Switches are used to construct an internal network for systems management and to interface external networks. Various switches from Cisco may be used, with Extreme Networks providing alternative solutions in a custom offering. Terminal servers provide remote access to the operating system consoles of the nodes through a serial network. Additional functionality added by a KVM (Keyboard, Video, Mouse) switch and the C2T technology allow association of a single KVM with any node on the systems management network. Equinox is typically used, and is provided with the Cluster 1300. External I/O, such as SCSI Raid devices should typically also be in the racks with the nodes, switches, etc.
Software:
The basic software for the Linux cluster is the Linux operating system. IBM supports the Red Hat distribution in the Cluster 1300, and may support other standard distributions through a custom offering. The Linux OS is installed on each node in the cluster. Fortunately, for large clusters, the systems management software allows the installation to be cascaded through the management networks. While Linux clusters are sometimes called Beowulf clusters, this is an oversimplification. Beowulf clustering software was the first real attempt to combine Intel Architecture systems into a single entity, Beowulf itself is not a thing, but rather a collection of software that must be downloaded and integrated by the installer. Tutorials on Beowulf installation have recently expanded to fill two days, it has been reported. Its not simple. A more mature systems management, used on the custom solution Linux clusters (not including the Cluster 1300), is called xCAT. The following functions are supported by xCAT: Remote hardware & software reset (Ctrl+Alt+Del) Remote OS/POST/BIOS console Remote vitals (fan speed/temp/etc...) Remote power control (on/off/state) Remote hardware event logs Remote hardware inventory Parallel remote shell Command line interface (no GUI) Single operations can be applied in parallel to multiple nodes Network installation (PXE) Support for various user defined node types SNMP hardware alerts
Page 7 of 17
A recent ITSO Redbook details the installation considerations for an xCAT cluster. It is available at: http://publib-b.boulder.ibm.com/Redbooks.nsf/65f0d9cea6e0ab57852569e0007452bb/e0384f6 e6982b28986256a0f005b7ba4?OpenDocument Note that xCAT uses the xSeries service processor and that xCAT is not in Open Source. It is supplied freely with custom solution IBM Linux clusters as an as-is software component. It is written in script, as a result source code is included. A new IBM Licensed Program Product, Cluster System Management for Linux (CSM), provides systems management function in a fashion similar to the systems management layer on the IBM RS/6000 SP (i.e. PSSP). CSM is the standard system management software for the Cluster 1300. CSM for Linux includes technology derived from IBM Parallel System Support Programs for AIX. It is an equivalent offering to PSSP and AIX clusters, intended for customers who want robust cluster systems management on an open, Intel processor-based server. Other software, both Open Source and proprietary, may be selected and tailored to the customers needs, and may be installed by the installers as part of a complete system solution. Examples of this software include the Portable Batch Scheduler (PBS) and the Maui Scheduler, both from Open Source, providing sophisticated job and resource scheduling, and accounting data. Other examples include MPICH for parallel programming, many math libraries, parallel debug and performance tools, and many ISV applications.
Page 8 of 17
Three functional networks must be included: A network for inter process communication (IPC) is needed. Its speed is ` dependent on the problems to be addressed. A network is needed for file I/O. If an IPC network is present, It might also be the I/O network. A network is needed for systems management. Usually 10/100 Ethernet, it depends on cascaded topologies using head nodes, management nodes, etc., connected by Ethernet switches. A terminal server must be included. Network topographies, network addresses, and cable schemes (to include cable lengths) must be addressed. While a rack will theoretically hold 36 or 42 1U boxes, we often leave space in the rack for expansion, and/or to manage power and heat issues. And finally, the physical characteristics (power, heat, floor loading, service clearances, etc.) must be addressed. Traditional machine rooms with raised floors may well be needed, as may adequate conditioned power and adequate cooling, both in BTUs and in Cubic Feet per Minute of cool air distribution. The Cluster 1300, being a product as opposed to a custom solution, addresses all of these issues. Best practices are automatically observed.
Page 9 of 17
Backbone Networks
10/100/1000mbit
Public Network
10/100mbit
100/1000mbit
100/1000mbit
100/1000mbit
Page 10 of 17
IBM
The introduction of the Cluster 1300 product represents a formalization of the previous custom solutions. The initial product is made up of x330 and x342 nodes, in IBM 19 standard 36U racks. Switches (Myrinet) for inter process communication are optional. Ethernet switches and terminal servers (Equinox) are included. This is conceptually the same as custom built solutions that were previously delivered. A major difference is the integration. While the custom solutions may be integrated in any fulfillment center, or even on the customers floor (not recommended), the product will be manufactured (i.e. integrated) and tested at the IBM plant in Beaverton, Oregon. Perhaps a more important effect of productization is the manufacturing infrastructure that comes as a consequence. The product has a single machine type and model. It in fact is a new machine type that designates a new server type, the IBM ^ Cluster 1300. Many advantages are realized from the productization: The manufacturing facility will order and manage parts inventory, including non-IBM parts. Normal forecasting and supply demand planning will follow naturally. Testing, to include fully integrated system test will be included. Machine histories will be known. MES orders can be entertained and fulfilled. Manufacturing considerations such as packaging, cooling, and emissions testing are resolved. While there are many economies accrued from a manufactured product, such as lower costs and higher quality, not all issues are addressed with this approach. Requirements that fall outside the product boundaries may occur, and special bids will be entertained.
Hardware:
Software:
The software for the Linux cluster product is essentially that of the customized solution. The basic software stack includes the OS and management software common to almost all Linux clusters; for the Cluster 1300, this is the Red Hat Linux distribution and CSM. Specialized software, either Open Source or proprietary, may be included in the installation as part of an installation services contract.
A future objective is to be able to deliver application specific software stacks. A commercial software stack, for example, might include WebSphere, DB2, MySQL, etc. An HPC stack might include MPICH, PVM, the Maui Scheduler, math libraries, compilers, and programming/profiling tools. Another application area that lends itself to this approach is
Page 11 of 17
bioinformatics. Many of these solutions are being validated on the Cluster 1300 at time of announcement. The IBM ^ Cluster 1300 greatly simplifies the design task by offering standard configurations, with optional features. A standard configuration tool is available for field use.
It is hoped that Linux cluster products will follow using: IA-64 systems POWER systems Blade Architecture dense servers And hybrids
IA-64 systems could be offered in addition to the IA-32 systems described. As clusters they should be architecturally similar to the IA-32 systems previously described. They should represent a different price/performance point than the traditional IA-32 systems, and should not immediately displace the IA-32 systems. IA-64 systems, while more expensive, will have twice the floating-point performance of the IA-32 systems. They should therefore be immediately attractive for the numerically intensive applications in HPC and even in portfolio management. A few applications should have an immediate need for the 64-bit address space. Over time, as more ISVs port their code to take advantage of the 64-bit architecture, it should become more generally attractive. The POWER4TM processor with its extreme performance in both floating-point and in integer, and the parallelism inherent in the latest SuperScalar design, promise a platform that should outperform the competition in almost every metric. Its excellent Linux performance should follow, and POWER4 Linux clusters should likely meet a market demand. Rocket Logic (RLX) makes a blade architecture system that offers extreme packing density and runs Linux. IBM has a reseller agreement to market RLX products. Hybrid solutions, with a IBM ^ pSeriesTM rack-mounted system (or 2 for HA) doing database, and an Intel or RLX based server farm in the same frame, all running Linux, may be an attractive follow-on.
Page 12 of 17
Systems management is considered to be the single most important attribute of a large cluster. While small clusters can manage with minimal systems management, the need sometimes goes up super linearly with the cluster size. Customers with experience in large systems, especially the RS/6000 SP, already understand this. Consider the tasks needed to manage a cluster: Each node must be booted. Software changes/fixes must be applied to each node. The cluster must have a place, preferably a single place, for users to log in, run compiles, access system resources, etc. The systems administrator must have a way to control access to systems resources (security) and to monitor and manage the health of the system. There are two major components of systems management; the service processor and the software management layer.
The Service Processor: xSeries systems used in our Linux clusters today (x330 and x342) have an imbedded service processor. It is included in the price of the system and does not occupy a slot. It allows remote power on/off, monitors hardware health, and provides alerts for various modes of failure detection. There is no standard for service processors; each vendors implementation (if it exists) is unique. The System Management Software: IBM Linux clusters, whether controlled by the historic xCAT or CSM, have a very high function systems management layer. Both are dependent on the IBM service processor (sometimes called SP, not to be confused with the RS/6000 SP). Neither would necessarily be expected to function correctly on another vendors cluster. CSM requires the IBM ^ Cluster 1300 hardware. In both cases, functions include hardware control, including power on/off and reset, software reset, OS/POST/BIOS console, vitals like temperature and fan speed, hardware event logs, and hardware inventory, all available to the system administrator through a remote interface. The above functions can also be monitored and generate SNMP hardware alerts. In addition, both software packages provide network installation tools and the ability to apply single operations to multiple nodes in parallel. While other venders do indeed provide some systems management function, our experience has been that they are less functional than xCAT or CSM. Both of the IBM solutions are architected for very large clusters; they are scalable. Scalability is a design criteria. Software not designed for scalability sometimes tends to perform exponentially worse as the system size increases. Our scalability can be demonstrated by IBMs very large (1024, so far) node IBM Linux clusters .
Page 13 of 17
xSeries hardware includes features that make the individual nodes less likely to fail, and to allow predictive maintenance where failures might occur. Built in hardware redundancy for many functions allows the system to gracefully survive many failures.
Page 14 of 17
Conclusion: Linux is increasingly popular and increasingly capable. Linux clusters represent a way to horizontally scale beyond the capabilities of any given Linux SMP implementation. The price/performance of the Intel platforms and the enhanced functions of the RISC platforms both provide attractive building blocks for Linux clusters. IBM has many years of clustering experience on various platforms and brings much of this experience to bear in the Linux cluster environment. The components necessary to successfully build and manage large Linux clusters, or Linux super clusters, include dense packaging, built-in systems management using service processors and highly capable scalable software, and strong systems integration skills. IBM has built and continues to build the worlds largest and most complex Linux clusters in response to our customers requirements. The intellectual capital that allows successful creation and implementation of the largest clusters is being brought to bear on the emerging commercial market for Linux clusters of all sizes.
Page 15 of 17
UNIX is a registered trademark of The Open Group in the United States and other countries.
Microsoft, Exel, and Wolfpack are registered trademarks or trademarks of the Microsoft Corporation. Linux is a registered trademark of Linus Torvalds. Intel is a trademark of Intel Corporation in the United States and/or other countries. Other company, product and service names may be trademarks or service marks of others. IBM may not offer the products, programs, services or features discussed herein in other countries, and the information may be subject to change without notice. General availability may vary by geography. IBM may have patents or pending patent applications covering subject matter in this presentation. The furnishing of this presentation does not give you any license to these patents. Send license inquires, in writing, to IBM Director of Licensing, IBM Corporation, 500 Columbus Avenue, Thornwood, NY 10594. IBM hardware products are manufactured from new parts, or new and used parts. Regardless, our warranty terms apply. All statements regarding IBM's future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only. Any performance data contained in this document was determined in a controlled environment. Results obtained in other operating environments may vary significantly. The information contained in this presentation has not been submitted to any formal IBM test and is distributed AS IS. While each item may have been reviewed by IBM for accuracy in a specific situation, there is no guarantee that the same or similar results will be obtained elsewhere. The use of this information or the implementation of any techniques described
Page 16 of 17
herein is a customer responsibility and depends on the customers ability to evaluate and integrate them into the customers operational environment. Customers attempting to adapt these techniques to their own environments do so at their own risk.
Page 17 of 17