GRID AND CLOUD COMPUTING
Explore the architecture and usage of open-source grid middleware libraries such as Globus Toolkit (GT6) and Hadoop. Learn about grid resource management, distributed processing with Hadoop, middleware standards, and key components of grid middleware packages. Discover how these tools facilitate sharing heterogeneous resources and support virtual organizations across the grid.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
GRID AND CLOUD COMPUTING Globus Toolkit and Hadoop Courtesy: Dr Gnanasekaran Thangavel https://web.uettaxila.edu.pk/CMS/2022/SPR2022/teGNCCms/index.asp
UNIT 6: Globus Toolkit and Hadoop Open-source grid middleware packages: Globus Toolkit (GT6) Architecture, Components / Packages Usage of Globus Distributed processing using Hadoop: - Introduction to Hadoop - Mapreduce, Input splitting, map and reduce functions, specifying input and output parameters, configuring and running a job - Design of Hadoop file system, HDFS concepts, command line and java interface, dataflow of File read & File write. 2/24/2025 2
Open-source grid middleware packages The Open Grid Forum (OGF) and Object Management Group (OMG) are two well-formed organizations behind various standards Middleware is the software layer that connects software components. It lies between operating system and the applications. Grid middleware is a specially designed layer between hardware and software, to enable the sharing of heterogeneous resources and managing virtual organizations created around the grid. The popular grid middleware are: 1. BOINC - Berkeley Open Infrastructure for Network Computing. 2. UNICORE - Middleware developed by the German grid computing community. 3. Globus (GT6) - A middleware library jointly developed by Argonne National Lab., Univ. of Chicago, and USC Information Science Institute, funded by DARPA, NSF, and NIH. 4. CGSP in ChinaGrid - The CGSP (China Grid Support Platform) is a middleware library developed by 20 top universities in China as part of the ChinaGrid Project. 2/24/2025 3
Open-source grid middleware packages conti 5. Condor-G - Originally developed at the Univ. of Wisconsin for general distributed computing, and later extended to Condor-G for grid job management. 6. Sun Grid Engine (SGE) - Developed by Sun Microsystems for business grid applications. Applied to private grids and local clusters within enterprises or campuses. 7. gLight - Born from the collaborative efforts of more than 80 people in 12 different academic and industrial research centers as part of the EGEE Project, gLite provided a framework for building grid applications tapping into the power of distributed computing and storage resources across the Internet. 2/24/2025 4
The Globus Toolkit Architecture (GT6) The Globus Toolkit, is an open middleware library for the grid computing communities. These open-source software libraries support many operational grids and their applications on international level. The toolkit addresses common problems and issues related to grid resource discovery, management, communication, security, fault detection, and portability. The software itself provides a variety of components and capabilities. The library includes a rich set of service implementations. The implemented software supports grid infrastructure management, provides tools for building new web services in Java , C, and Python, builds a powerful standard-based security infrastructure and client APIs (in different languages), and offers comprehensive command-line programs for accessing various grid services. The Globus Toolkit was initially motivated by a desire to remove obstacles that prevent seamless collaboration, and thus sharing of resources and services, in scientific and engineering applications. The shared resources can be computers, storage, data, services, networks, science instruments (e.g., sensors), and so on. The Globus library version GT6, is conceptually shown in Figure on next slide: 5 2/24/2025
GT6 is binary compatible with GT5 and GT5.2 2/24/2025 6
Grid Security Infrastructure (GSI) As Grid Resources and Users are Distributed and Owned by different organizations, only authorized users should be allowed to access them. A simple authentication infrastructure is needed. Also, both users and owners should be protected from each other. The Users need to be assured about security of their: Data Code Message GSI Provides all of the above GSI C are the C language Libraries included in Globus Toolkit They help in compiling and patching OpenSSH for use with GSI (call GSI- Open SSH) 2/24/2025 7
MyProxy Online repository of encrypted GSI credentials Provides authenticated retrieval of proxy credentials over the network Improves usability Retrieve proxy credentials when/where needed without managing private key and certificate files Improves security Long-term credentials stored encrypted on a well-secured server % bin/grid-proxy-init Your identity: /O=Grid/OU=Example/CN=Adeel Akram Enter GRID pass phrase for this identity: Creating proxy ................................. Done Your proxy is valid until: Tue Oct 20 01:22:30 2022 2/24/2025 8
Credential Accessibility with MyProxy A MyProxy server can be deployed for a single user, a virtual organization, or a Certificate Authority (CA) Users can delegate proxy credentials to the MyProxy server for storage Can store multiple credentials with different names, lifetimes, and access policies Then, they can retrieve stored proxies when needed using MyProxy client tools And allow trusted services to retrieve proxies No need to copy certificate and key files between machines 2/24/2025 9
GridFTP GridFTP is an extension of the File Transfer Protocol (FTP) for grid computing. The protocol was defined within the GridFTP working group of the Open Grid Forum. There are multiple implementations of this protocol; the most widely used is that provided by the Globus Toolkit. 2/24/2025 10
RLS The Replica Location Service (RLS) provides a framework for tracking the physical locations of data that has been replicated. At its simplest, RLS maps logical names (which don't include specific pathnames or storage system information) to physical names (which do include storage system addresses and specific pathnames). 2/24/2025 11
GRAM5 The Globus Toolkit includes a set of service components collectively referred to as the Globus Resource Allocation Manager (GRAM). GRAM simplifies the use of remote systems by providing a single standard interface for requesting and using remote system resources for the execution of "jobs". The most common use (and the best supported use) of GRAM is remote job submission and control. This is typically used to support distributed computing applications. 2/24/2025 12
Globus XIO Globus XIO is an extensible input/output library written in C for the Globus Toolkit. It provides a single API (open/close/read/write) that supports multiple wired (communication) protocols, with protocol implementations encapsulated as drivers. The XIO drivers distributed with 6.0 include TCP, UDP, file, HTTP, GSI, GSSAPI_FTP, TELNET and queuing. In addition, Globus XIO provides a driver development interface for use by protocol developers. 2/24/2025 13
The Globus Toolkit Architecture 2/24/2025 14
The Globus Toolkit Architecture conti GT6 offers the middle-level core services in grid applications. The high-level services and tools, such as MPI , Condor-G, and Nimrod/G, are developed by third parties for general purpose distributed computing applications. The local services, such as LSF, TCP, Linux, and Condor, are at the botom level and are fundamental tools supplied by other developers. As a de facto standard in grid middleware, GT6 is based on industry- standard web service technologies. 2/24/2025 15
High Level Services and Tools DRM Distributed Resource Management Resource manager on ASCI supercomputer Cactus Grid-aware numerical solver framework MPICH-G2 Grid-enabled MPI Globusrun More complicated version of globus-job-run PUNCH Web-browser-based resource manager from Purdue University Nimrod/G Model computational jobs from Monash Grid Status Repository of state of jobs in grid Condor-G Condor job management layer to Globus 2/24/2025 16
GT Core Services GASS Globus Access to Secondary Storage File and executable staging and I/O redirection GridFTP - Grid File Transfer Protocol Reliable, high-performance FTP MDS Meta-computing Directory Service Maintains information about available resources GSI - Grid Security Interface Authentication, authorization via proxies, delegation, PKI, SSL Replica Catalog Manages partial copies of full data set across grid GRAM - Grid Resource Allocation Management Allocation, reservation, monitoring and control of programs on remote systems I/O Wrapper TCP, UDP, IP multicast and file I/O 2/24/2025 17
GT Local Services Condor Job and resource manager for compute-intensive jobs MPI Message Passing Interface Portability across plaforms LSF Load Sharing Facility Management of batch workload PBS Portable Batch System Scheduling / resource management NQE Network Queueing Environment Resource manager on Cray systems 2/24/2025 18
Functionalities of GT4 Global Resource Allocation Manager (GRAM) - Grid Resource Access and Management (HTTP-based) Communication (Nexus) - Unicast and multicast communication Grid Security Infrastructure (GSI) - Authentication and related security services Monitoring and Discovery Service (MDS) - Distributed access to structure and state information Health and Status (HBM) - Heartbeat monitoring of system components Global Access of Secondary Storage (GASS) - Grid access of data in remote secondary storage Grid File Transfer (GridFTP) Inter-node fast file transfer 2/24/2025 19
Globus Job Workflow 2/24/2025 20
Globus Job Workflow A typical job execution sequence proceeds as follows: The user delegates his credentials to a delegation service. The user submits a job request to GRAM with the delegation identifier as a parameter. GRAM parses the request, retrieves the user proxy certificate from the delegation service, and then acts on behalf of the user. GRAM sends a transfer request to the RFT (Reliable File Transfer), which applies GridFTP to bring in the necessary files. GRAM invokes a local scheduler via a GRAM adapter and the SEG (Scheduler Event Generator) initiates a set of user jobs. The local scheduler reports the job state to the SEG. Once the job is complete, GRAM uses RFT and GridFTP to stage out the resultant files. The grid monitors the progress of these operations and sends the user a notification. 2/24/2025 21
Client-Globus Interactions There are strong interactions between provider programs and user code. GT6 makes heavy use of industry-standard web service protocols and mechanisms in service description, discovery, access, authentication, authorization. GT6 makes extensive use of java, C, and Python to write user code. Web service mechanisms define specific interfaces for grid computing. Web services provide flexible, extensible, and widely adopted XML-based interfaces. 2/24/2025 22
Data Management Using GT6 For Grid applications one needs to provide access to and/or integrate large quantities of data at multiple sites. The GT4 tools can be used individually or in conjunction with other tools to develop interesting solutions to efficient data access. The following list briefly introduces these GT6 tools: 1. Grid FTP supports reliable, secure, and fast memory-to-memory and disk-to-disk data movement over high-bandwidth WANs. Based on the popular FTP protocol for internet file transfer, Grid FTP adds additional features such as parallel data transfer, third-party data transfer, and striped data transfer. In addition, Grid FTP benefits from using the strong Globus Security Infra structure for securing data channels with authentication and reusability. It has been reported that the grid has achieved 27 Gbit/second end-to-end transfer speeds over some WANs. 2. RFT provides reliable management of multiple Grid FTP transfers. It has been used to orchestrate the transfer of millions of files among many sites simultaneously. 3. RLS (Replica Location Service) is a scalable system for maintaining and providing access to information about the location of replicated files and data sets. 4. OGSA-DAI (Globus Data Access and Integration) tools were developed by the UK eScience program and provide access to relational and XML databases. 2/24/2025 23
Introduction to Hadoop Hadoop is an open source, Java-based programming framework that supports the processing and storage of extremely large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation. Hadoop essentially provides two things: A distributed filesystem called HDFS (Hadoop Distributed File System) A framework and API for building and running MapReduce jobs 2/24/2025 24
It is a flexible and highly-available architecture for large scale computation and data processing on a network of commodity hardware. Hadoop offers a software platform that was originally developed by Yahoo! group. The package enables users to write and run applications over vast amounts of distributed data. Users can easily scale Hadoop to store and process petabytes of data in the web space. Hadoop is economical in that it comes with an open-source version of MapReduce that minimizes overhead in task spawning and massive data communication. It is efficient, as it processes data with a high degree of parallelism across a large number of commodity nodes, and it is reliable in that it automatically keeps multiple data copies to facilitate redeployment of computing tasks upon unexpected system failures. 2/24/2025 25
Hadoop Distributed File System (HDFS) HDFS is structured similarly to a regular Unix filesystem except that data storage is distributed across several machines. It is not intended as a replacement to a regular filesystem, but rather as a filesystem-like layer for large distributed systems to use. It has in built mechanisms to handle machine outages, and is optimized for throughput rather than latency. 2/24/2025 26
HDFS Cluster Machines There are two and a half types of machine in a HDFS cluster: Datanode - where HDFS actually stores the data, there are usually quite a few of these. Namenode - the master machine. It controls all the meta data for the cluster. e.g. - what blocks make up a file, and what datanodes those blocks are stored on. Secondary Namenode - this is NOT a backup namenode, but is a separate service that keeps a copy of both the edit logs, and filesystem image, merging them periodically to keep the size reasonable. 2/24/2025 27
HDFS Cluster Machines NameNode DataNodes Secondary Name Nodes Data can be accessed using either the Java API, or the Hadoop command line client. Many operations are similar to their Unix counterparts. 2/24/2025 28
Hadoops Architecture NameNode: Stores metadata for the files, like the directory structure of a typical FS. The server holding the NameNode instance is quite crucial, as there is only one. Transaction log for file deletes/adds, etc. Does not use transactions for whole blocks or file-streams, only metadata. Handles creation of more replica blocks, when necessary, after a DataNode failure DataNode: Stores the actual data in HDFS Can run on any underlying filesystem (ext3/4, NTFS, etc) Notifies NameNode of what blocks it has NameNode replicates blocks 2x in local rack, 1x elsewhere 2/24/2025 29
Hadoops Architecture Distributed, with some centralization Main nodes of cluster are where most of the computational power and storage of the system lies Main nodes run TaskTracker to accept and reply to MapReduce tasks, and with DataNode to store needed blocks closely as possible Central control node runs NameNode to keep track of HDFS directories & files, and JobTracker to dispatch compute tasks to TaskTracker Written in Java, also supports Python and Ruby 2/24/2025 30
Hadoops Architecture Hadoop Distributed Filesystem Tailored to needs of MapReduce Targeted towards many reads of filestreams Writes are more costly High degree of data replication (3x by default) No need for RAID on normal nodes Large blocksize (64MB) Location awareness of DataNodes in network 2/24/2025 31
HDFS Command Line list files in the root directory hadoop fs -ls / list files in my home directory hadoop fs -ls ./ list files in my home directory hadoop fs -text ./file.txt.gz upload and retrieve a file hadoop fs -put ./localfile.txt /home/matthew/remotefile.txt hadoop fs -get /home/matthew/remotefile.txt ./local/file/path/file.txt 2/24/2025 32
HDFS Features HDFS is optimized differently than a regular file system. It is designed for non-realtime applications demanding high throughput instead of online applications demanding low latency. For example, files cannot be modified once written, and the latency of reads/writes is really bad by filesystem standards. On the flip side, throughput scales fairly linearly with the number of datanodes in a cluster, so it can handle workloads no single machine would ever be able to sustain. 2/24/2025 33
HDFS Features Cont... Failure tolerant - data can be duplicated across multiple datanodes to protect against machine failures. The industry standard seems to be a replication factor of 3 (everything is stored on three machines). Scalability - data transfers happen directly with the datanodes so your read/write capacity scales fairly well with the number of datanodes Space - need more disk space? Just add more datanodes and re- balance Industry standard - Lots of other distributed applications build on top of HDFS (HBase, Map-Reduce) Optimized for MapReduce 2/24/2025 34
Deployment of Task Trackers with DataNodes 2/24/2025 35
Hadoops Architecture MapReduce Engine: JobTracker & TaskTracker JobTracker splits up data into smaller tasks( Map ) and sends it to the TaskTracker process in each node TaskTracker reports back to the JobTracker node and reports on job progress, sends data ( Reduce ) or requests new jobs None of these components are necessarily limited to using HDFS Many other distributed file-systems with quite different architectures work Many other software packages besides Hadoop's MapReduce platform make use of HDFS 2/24/2025 36
MapReduce The second fundamental part of Hadoop is the MapReduce layer. This is made up of two subcomponents: An API for writing MapReduce workflows in Java. A set of services for managing the execution of these workflows. 2/24/2025 37
Map and Reduce APIs The basic premise is this: Map tasks perform a transformation (Distribution). Reduce tasks perform an aggregation (Collection). 2/24/2025 38
MapReduce Services Hadoop MapReduce comes with two primary services for scheduling and running MapReduce jobs. They are: the Job Tracker (JT) and the Task Trackers (TT) 2/24/2025 39
JT and TTs The JT is the master and is in charge of allocating tasks to task trackers and scheduling these tasks globally. TTs are in charge of running the Map and Reduce tasks themselves. 2/24/2025 40
JT and TTs Optimization Many things can go wrong in a big distributed system, so these services have some clever tricks to ensure that your job finishes successfully: Automatic retries - if a task fails, it is retried N times (usually 3) on different task trackers. Data locality optimizations - if you co-locate a TT with a HDFS Datanode (which you should) it will take advantage of data locality to make reading the data faster Blacklisting a bad TT - if the JT detects that a TT has too many failed tasks, it will blacklist it. No tasks will then be scheduled on this task tracker. Speculative Execution - the JT can schedule the same task to run on several machines at the same time, just in case some machines are slower than others. When one version finishes, the others are killed. 2/24/2025 41
Hadoop Framework Tools 2/24/2025 42
Hadoop in the Wild Hadoop is in use at most organizations that handle big data: Yahoo! Facebook Amazon Netflix Etc Some examples of scale: Yahoo! s Search Webmap runs on 10,000 core Linux cluster and powers Yahoo! Web search FB s Hadoop cluster hosts 100+ PB of data (July, 2012) & growing at PB/day (Nov, 2012) 2/24/2025 43
Three main applications of Hadoop Advertisement (Mining user behavior to generate recommendations) Searches (group related documents) Security (search for uncommon patterns) 2/24/2025 44
Hadoop Highlights Distributed File System Fault Tolerance Open Data Format Flexible Schema Queryable Database Why use Hadoop? Need to process Multi Petabyte Datasets Data may not have strict schema Expensive to build reliability in each application Nodes fails everyday Need common infrastructure Very Large Distributed File System Assumes Commodity Hardware Optimized for Batch Processing Runs on heterogeneous OS 2/24/2025 45
DataNode A Block Server Stores data in local file system Stores meta-data of a block - checksum Serves data and meta-data to clients Block Report Periodically sends a report of all existing blocks to NameNode Facilitate Pipelining of Data Forwards data to other specified DataNodes 2/24/2025 46
Block Placement Replication Strategy One replica on local node Second replica on a remote rack Third replica on same remote rack Additional replicas are randomly placed Clients read from nearest replica 2/24/2025 47
Data Correctness Use Checksums to validate data CRC32 File Creation Client computes checksum per 512 byte DataNode stores the checksum File Access Client retrieves the data and checksum from DataNode If validation fails, client tries other replicas 2/24/2025 48
Data Pipelining Client retrieves a list of DataNodes on which to place replicas of a block Client writes block to the first DataNode The first DataNode forwards the data to the next DataNode in the Pipeline When all replicas are written, the client moves on to write the next block in file 2/24/2025 49
MapReduce Usage Log processing Web search indexing Ad-hoc queries 2/24/2025 50