Distributed Computing

The Importance of Networking

Integration of computer systems and other electronic components has long been important for both local and remote users.

As noted in CERN Courier, Aug'1973, an important reason for adopting a distributed computing philosophy is to avoid the need to build large specialised systems for particular experiments which would be expensive in terms of equipment and man hours. The network can also pump data directly into the main computer and for later storage on tape rather than moving tapes from one place to another. Stored data can then be made available across the network for different purposes without duplication. In the 1970s, the Daresbury network was designed to enable many users simultaneous access to equipment without interference between them. Local computers and specialist software are however kept to a minimum and there is no local data storage. Everything was connected to the central computer using twelve high speed links of 12 Mbits/s, multiplexed by the front end IBM 1802. Local equipment comprises of PDP, ARGUS, IBM, etc. computers plus multiple CAMAC units. This network was installed in 1968 and was designed to be easily extended as necessary.

This is still very much the philosophy for grid computing today.

Figure 18: Daresbury Network in 1973.
Image cern_courier-8-73

In the 1970s all interfaces were built to the CAMAC, Computer Automated Measurement And Control standard. CAMAC was a modular system originally developed by ESONE: European Standards and Nuclear Electronics under a collaboration of several European nuclear laboratories. CAMAC modules enabled rapid design and build of digital electronic control systems. Up to 600 slim rack mounted modules were available built to internationally agreed standards and available from more than one manufacturer. It was found to be ideal for computer control in many industrial and scientific fields. It was demonstrated at the COMPEC'74 computer peripheral show in London. Daresbury staff provided a display including a PDP-11 for the CAMAC Association stand at the show.

Some more photos and information about CAMAC and associated systems on the SRS can be found here.

As noted above, in 1974 in addition to local terminals at Daresbury a PDP-11 supported remote job entry (RJE) stations at Daresbury, Lancaster, Liverpool, Manchester and Sheffield. There was also a connection over the SRCNet packet switching network to a GEC 4080 computer fulfilling a similar function at the Rutherford site.

These networks used packet switched protocols in which data are bundled up into ``datagrams'', blocks of a few kBytes each having extra information about the sending and receiving addresses. Daresbury had one of the eight main packet switched exchanges and was a Network Operations Centre. Protocol for handling datagrams is very sophisticated; individual packets may not necessarily arrive in the order they were sent, perhaps taking different length paths through the network if traffic is high. Each datagram is sent separately, as in a postal system rather than establishing a fixed connection such as through a telephone exchange. Packets may be lost and need to be re-sent. However, error free communication is possible using sophisticated networking software developed over the last 15 years. Our use of computers, and much of our modern way of life, now depends on this capability (e.g. in banks, libraries, mail order, credit cards).

Experiments with ethernet local area network started at Daresbury as early as Nov'1981. Dr. Brian Davies initiated an evaluation of ethernet in comparison to the currently used rival Cambridge Ring network with the purchase of £20k worth of equipment. This ethernet to connect four work stations was purchased from Thames Systems. At the time there were several different ethernet standards - the Ungermann-Bass system was chosen as it was the one supported by Xerox.

Access to the distributed UNIX programming environment is now available to the academic community through the Joint Academic Network (JANET). This was established in 1984 and preceded similar European initiatives. Almost all HEIs in the UK are now registered on this network, together with some commercial companies. JANET had a link to the main European and North American IBM network (EARN/ BITNET) in the late 1980s, which also provided a world wide electronic mail and data transfer facility, including Japan and Australia.

These services have now largely been replaced by or extended to include networks built upon the internationally standard TCP/IP (Internet Protocol) communications similar to the campus wide Ethernet.

Connection to JANET from major nodes in 1994 was ``megastream'' based at 2 Mbits/s. Connections to most academic establishments was via ``kilostream'' with an upper limit of 64 kbits/s. The SuperJANET project was expected to supply 34 Mbits/s connection to participating sites, this technology allowing the SuperJIPS project to proceed. The JANET Internet Protocol Service (JIPS) has successfully provided IP services to the academic community.

Since the protocols in use on local area networks are the same, access to a similar bandwidth, but nationally, offers many exciting possibilities. These include national workstation clustering and parallel execution, interactive visualisation of programs running at the Supercomputing centres using local graphics resources, and uses of remote systems particularly suited to certain types of computation by use of Remote Procedure Calls (RPC). This is now beginning to be realised on grids through the Grid and Web Services.

High level programming software, which encompasses resources on a network, has been developed at Daresbury since the mid 1980s to provide a platform for writing scientific programs and visualising results. We also need to make a single program run efficiently on those compute servers with multiple processors, such as the Intel hypercube and workstation clusters, by dividing it into small pieces which exchange data. Our strategy is therefore to treat the whole environment as one large parallel computer with data passing between the different machines. This is possible with special software to connect Ethernet addresses, with sockets at either end to send or receive data across a link. This is similar to a telephone system with one computer calling another and talking to it when the call is answered. Of course one of the difficulties of parallel computing is when a call isn't answered, but let us assume that this does not happen very often.

One solution is a harness program which can be invoked from any workstation. It effectively logs into other machines and starts one, or several tasks running and passes data to them. Such harnesses can have an interface standard to all computers needed. At Daresbury a harness called FORTNET (Fortran for Networks) was developed [59]. The only change in moving from one machine to another, apart from speed, is currently the need to recompile the source program. It is even possible to use the same program on a single parallel computer such as the Intel, Meiko, HPCx or a NW-GRID cluster in this portable way after developing it on a workstation cluster. Worldwide activity in message passing software for scientific computing has led to the MPI: Message Passing Interface standard which is now widely used.

The UNIX revolution, 1988-1994

The following text is based on that from the article in Physics World [5].

The change from a single AS/7000 mainframe to a Convex C220 running UNIX at Daresbury in 1988 not only meant a 16 fold increase in CPU performance and a 16 fold increase in main memory, but reflected the beginning of the mainstream trend towards distributed computing. Although computational scientists had access to the Cray X-MP at RAL, the differences between the Cray and AS/7000 made software development awkward. The arrival of UNIX was a watershed for Daresbury, enabling the subsequent rapid development of a powerful distributed computing network including super workstations and parallel clusters.

The benefits of this kind of distributed computing are considerable, and qualitatively change the way scientists approach computational problems. Firstly it is a flexible way of providing computer power responsive both to the rapidly changing hardware market place and crucially to the changing requirements of users' scientific applications. New components can be added simply, or upgraded as necessary. All the systems on our network were acquired according to particular needs of the scientific community which we support.

The use of specialised machines can be a spectacularly cost effective route to high performance. The current generation of distributed memory parallel machines can for instance equal and often out perform conventional vector supercomputers at approximately one tenth of the cost. Therefore we have a number of these on our network to provide high performance floating point arithmetic. However our experience has shown that any specialised system including a parallel one is most useful to the scientist when integrated with others of complementary functionality. These facilities must also be easily accessible.

The second feature of distributed computing is the integration of the different servers into one system via the network.

Thirdly users can exploit a kind of "network parallelism" - several machines in a cluster share parts of a single task and work concurrently. Integrating these aspects allows one to use the machines in a complementary was on a single application. For example computationally intensive parts of a problem can be "spawned" to a high performance parallel or vector machine from a graphical workstation environment.

The whole question of supercomputer provision is under continual review. A meeting held at the Royal Institution of Great Britain on 24th September 1992 examined the needs within the UK for further large scale computing resources to tackle both present and future research [16]. Some of Daresbury's contributions to this provision were highlighted in the Daresbury Laboratory Annual Report.

A trend internationally in supercomputing activities is the rapid growth in ``grand challenge'' applications or projects. Such projects are characterised by: (i) a large number of collaborating groups, drawn from academia, research laboratories and industry; and (ii) an insatiable appetite for supercomputer cycles. The latter attribute is readily explained by the nature of the problem areas under investigation - examples include: the tracking of pollutants and pollution control in air, ground water and the ocean; pharmaceutical drug design; chemical reaction and catalysis; structural engineering with applications in, for instance, the building and explosives industries; whole vehicle design and modelling in the aerospace and car industries; and many others. Supercomputers are vital in all these projects, not only to test new theories to high accuracy against the most exact experimental data (e.g. from the Synchrotron Radiation Source), but to enable integration in industrial design, modelling for verification to assess costs and side effects and for optimisation of process risks and costs.

The theme of international collaboration in high performance computing underpins recent programmes from the European Commission, who have announced funding for projects to port a number of major codes to parallel systems driven by consortia involving supercomputer centres, academic establishments and industry.

Many of the software techniques that are underpinning advances in the ``grand challenge'' arena are also being brought to bear on a new generation of powerful workstations, which may be used by individual researchers and also clustered to support parallel computation and sharing of data via a local area network. Through the availability of such resources, funded in many instances by the Science Board's Computational Science Initiative (CSI), more researchers can benefit from the software developed and maintained at Daresbury under the auspices of the Collaborative Computational Projects (CCPs).

Computational science of world class status needs commensurate computing resources, and both the SRS and CCP communities make extensive used of the national supercomputing facilities at the Universities of London (Cray X-MP/28 and Convex C380) and Manchester (Amdahl VP1200) and the Joint Research Council's Cray X-MP/416 at the Rutherford Appleton Laboratory The 64 node Intel iPSC/860 at Daresbury and the 256p rocessor Cray T3D at the Edinburgh Parallel Computing Centre. Similar facilities are provided at Orsay in France (by the CNRS) several sites in Germany (by the DFG and local Länder) and also in the Netherlands and Italy. These centres provide powerful processors and considerable memory, disk, mass storage and user support resources.

However high performance computing also requires visualisation facilities, a range of programming tools and ease of access and availablity. It has long been realised that these supercomputing centres must be supplemented and complemented by quite powerful local facilities in university departments and research centres. This is also true in industry.

In 1986 therefore the SERC established the Computational Science Initiative (CSI) to provide distributed high performance computing for the biological science, chemistry, physics and mathematics communities. The initiative is now in its 7th year and was renamed the Distributed Computing Programme (DisCo) in 1993. It has provided hardware, software and maintenance support as well as staff and studentships to almost 100 research groups in these fields and also to materials researchers and users of the major synchrotron neutron beam and laser facilities at DL and RAL.

Although we only describe our own experiences many other laboratories in Europe and the USA have distributed systems similar to that at Daresbury which therefore should only be considered as an example. In Britain and the rest of Europe many higher education institutions (HEIs) now have their own campus wide networks with workstations poviding access to shared resources. One long established example is NUNET linking Durham and Newcastle Universities. In the late 1970s this comprised two IBM mainframes and distributed terminals - it now has a wide assortment of machines including parallel computers and a dedicated cluster of fifteen IBM RS/6000 workstations. Examples in the USA include the national laboratories such as Argonne, NASA Ames, Oak Ridge and Lawrence Livermore. Worldwide collaboration between scientists is helped enormously by having a pool of applications programs which can be used at key sites within a standard software environment.

The adoption of distributed computing by so many different groups aided by the UNIX operating system and software is indicative of its success. Indeed the widespread acceptance of UNIX and the international standards POSIX and OSF/1 related to UNIX are a crucial issue. Distributed computing and networking encourages interaction between scientists and their software and creates a driving force behind standardisation of hardware interconnection, operating systems, graphical interfaces and computer languages. Much free software is now available in this standard environment and the unrestricted access to information is underlined by the creation of the World Wide Web, by which any UNIX computer user can interrogate repositories of information through an interactive window interface at any registered site in the world. Daresbury is one of these registered sites.

We now consider in more detail the way services are provided for the use by the different components of our distributed computing system, outlining the rôles of the network the parallel processing compute servers and graphical workstations.

The key elements in the Daresbury local area network (LAN) are shown in Figure 19. There are "compute servers", high-performance computers for numerical tasks, such as the Intel and Meiko parallel machines, "graphic servers", optimised computers for interactive visualisation and control, the Apollo and Silicon Graphics superworkstations, "file servers", multi-user computers with high storage capacity and good input/output bandwidth also including access to e-mail and network facilities, such as the Convex and Sun servers and netstor, and "clusters" such as the six IBM RS/6000 systems or the five HP series 700 systems with their own high speed interconnetion.

Figure 19: Small part of Daresbury LAN in 1994
Image network

The complete LAN is made up of a number of interconnected Ethernet networks supporting a standard protocol which enables each computer to be identified with a unique name and Ethernet address. There were in 1994 about 100 Sun and Silicon Graphics office workstations as well as IBM and Apple/ Mackintosh PCs. The heavy traffic between all these and the file servers is divided onto a number of branches by bridges from the central ``spine''. Software provided with the UNIX operating system can provide the user almost transparent access to both data storage and shared software. This avoids duplication and facilitates maintenance, upgrades and access to all shared resources. Remote log in to any machine on the network is possible with the correct password. Other software supports access to facilities on a remote machine through interactive windows on a workstation.

We note that by 2000 the use of vendor specific flavours of Unix, such as AIX (IBM), Irix (SGI), Solaris (Sun), Ultrix and True64 (DEC) had mostly been replaced by Linux, the widely used open source alternative.

The Future?

What are the problems associated with distributed computing? It is true that since users have more control over their computing, they also have more responsibility. They certainly need to know more about the hardware and the operating system, as well as their own code to get optimal performance. There is a danger of turning scientists into systems analysts or operators. It is true also that the day-to-day running of the machines, backing up, trouble shooting, etc. needs to be clearly organised - this side of operations is rather different for a distributed system than for one based on a centralised mainframe and more flexibility is needed generally requiring a larger number of dedicated staff.

The most critical element in a distributed system is the network itself. The network must perform well at all times, and a good deal of consideration must be given to its configuration and upgrading as the systems linked to it evolve. This activity also required a high complement of dedicated staff. Current networks become overloaded quickly. As with processors there will be great advances in network bandwidth over the next few years. Huge commercial as well as academic interest is already being shown in the potential offered by video conferencing and integrated media and retrieval facilities. This will give a vital boost for distributed computing. We note that since some of the above text was written in the 1990s, SuperJANET5 has been deployed and is connecting core sites with fibre network of up to 40 Gb/s.

If the potential difficulties are avoided, a distributed computing environment can be very productive for the computational scientist. It offers cost effective and high performance computing complementary to that provided at supercomputer centres, and offers it in a way which is responsive to the user's research needs with large amounts of compute power under their control. Certainly at Daresbury distributed computing has allowed us to operate at a much higher level than could be envisaged in the mid 1980s. We believe research computing will used distributed computing systems even more in the future.

Figure 20: Computing Scales in North West England
Image computing_scales

The above conclusions were written in 1994, but remain largely un-changed today. Our technology has changed, but our ambitions have not!

Rob Allan 2024-04-17