亚洲国产综合AV在线观看,亚洲AV无码成人精品区日韩,亚洲天天在线日亚洲洲精

Paper Learning: Data-Intensive Supercomputing: The case for DISC

sun — Thu, 10 Apr 2008 02:21:00 GMT

Recently, I have been studying something on DISC, the inspiration for which comes from Google's success that have been used to support search over the worldwide web. According to learning Data-Intensive Supercomputing: The case for DISC, maybe we can turn the idea of constructing a Google's infrastructure like system into reality, that is DISC.

DISC can be developed as a prototype system of Google's instructure, we can divide it into two types of partitions: one for application development, and the other for system research.
For the program development partitions, we can use available software, such as the open source code from the Hadoop project, to implement the file system and support for application programming.

For the systems research partitions, we can create our own design, studying the different kinds of design patterns. (e.g.: high-end hardware, low-cost component).

The paper Data-Intensive Supercomputing: The case for DISC gives me an entire impression of a new form of high-performance computing facility, and there are many other aspects that deeply attract me, I've taken notes on this paper as follows:

阅读Paper�Q?/span>

Data-Intensive Supercomputing: The case for DISC

Randal E. Bryant May 10, 2007 CMU-CS-07-128

Question�Q?/span>How can university researchers demonstrate the credibility of their work without having comparable computing facilities available?

1 Background

Describe a new form of high-performance computing facility (Data-Intensive Super Computer) that places emphasis on data, rather than raw computation, as the core focus of the system.

The author inspiration for DISC: comes from the server infrastructures that have been developed to support search over the worldwide web.

This paper outlines the case for DISC as an important direction for large-scale computing systems.

1.1 Motivation

The common role in the computations:

• Web search without language barriers. (No matter in which language they type the query)

• Inferring biological function from genomic sequences.

• Predicting and modeling the effects of earthquakes.

• Discovering new astronomical phenomena from telescope imagery data.

• Synthesizing realistic graphic animations.

• Understanding the spatial and temporal patterns of brain behavior based on MRI data.

2 Data-Intensive Super Computing

Conventional (Current) supercomputers:

are evaluated largely on the number of arithmetic operations they can supply each second to the application programs.

Advantage: highly structured data requires large amounts of computation.

Disadvantage:

1. It creates misguided priorities in the way these machines are designed, programmed, and operated;

2. Disregarding the importance of incorporating computation-proximate, fast-access data storage, and at the same time creating machines that are very difficult to program effectively;

3. The range of computational styles is restricted by the system structure.

The key principles of DISC:

1. Intrinsic, rather than extrinsic data.

2. High-level programming models for expressing computations over the data.

3. Interactive access.

4. Scalable mechanisms to ensure high reliability and availability. (error detection and handling)

3 Comparison to Other Large-Scale Computer Systems

3.1 Current Supercomputers

3.2 Transaction Processing Systems

3.3 Grid Systems

4 Google: A DISC Case Study

1. The Google system actively maintains cached copies of every document it can find on the Internet.

The system constructs complex index structures, summarizing information about the documents in forms that enable rapid identification of the documents most relevant to a particular query.

When a user submits a query, the front end servers direct the query to one of the clusters, where several hundred processors work together to determine the best matching documents based on the index structures. The system then retrieves the documents from their cached locations, creates brief summaries of the documents, orders them with the most relevant documents first, and determines which sponsored links should be placed on the page.

2. The Google hardware design is based on a philosophy of using components that emphasize low cost and low power over raw speed and reliability. Google keeps the hardware as simple as possible.

They make extensive use of redundancy and software-based reliability.

These failed components are removed and replaced without turning the system off.

Google has significantly lower operating costs in terms of power consumption and human labor than do other data centers.

3. MapReduce, that supports powerful forms of computation performed in parallel over large amounts of data.

Two function: a map function that generates values and associated keys from each document, and a reduction function that describes how all the data matching each possible key should be combined.

MapReduce can be used to compute statistics about documents, to create the index structures used by the search engine, and to implement their PageRank algorithm for quantifying the relative importance of different web documents.

4. BigTable: a distributed data structures, provides capabilities similar to those seen in database systems.

5 Possible Usage Model

The DISC operations could include user-specified functions in the style of Google’s MapReduce programming framework. As with databases, different users will be given different authority over what operations can be performed and what modifications can be made.

6 Constructing a General-Purpose DISC System

The open source project Hadoop implements capabilities similar to the Google file system and support for MapReduce.

Constructing a General-Purpose DISC System�Q?/span>

• Hardware Design.

There are a wide range of choices;

We need to understand the tradeoffs between the different hardware configurations and how well the system performs on different applications.

Google has made a compelling case for sticking with low-end nodes for web search applications, and the Google approach requires much more complex system software to overcome the limited performance and reliability of the components. But it might not be the most cost-effective solution for a smaller operation when personnel costs are considered.

• Programming Model.

1. One important software concept for scaling parallel computing beyond 100 or so processors is to incorporate error detection and recovery into the runtime system and to isolate programmers from both transient and permanent failures as much as possible.

Work on providing fault tolerance in a manner invisible to the application programmer started in the context of grid-style computing, but only with the advent of MapReduce and in recent work by Microsoft has it become recognized as an important capability for parallel systems.

2. We want programming models that dynamically adapt to the available resources and that perform well in a more asynchronous execution environment.

e.g.: Google’s implementation of MapReduce partitions a computation into a number of map and reduce tasks that are then scheduled dynamically onto a number of “worker” processors.

• Resource Management.

Problem: how to manage the computing and storage resources of a DISC system.

We want it to be available in an interactive mode and yet able to handle very large-scale computing tasks.

• Supporting Program Development.

Developing parallel programs is difficult, both in terms of correctness and to get good performance.

As a consequence, we must provide software development tools that allow correct programs to be written easily, while also enabling more detailed monitoring, analysis, and optimization of program performance.

• System Software.

System software is required for a variety of tasks, including fault diagnosis and isolation, system resource control, and data migration and replication.

Google and its competitors provide an existence proof that DISC systems can be implemented using available technology. Some additional topics include:

• How should the processors be designed for use in cluster machines?

• How can we effectively support different scientific communities in their data management and applications?

• Can we radically reduce the energy requirements for large-scale systems?

• How do we build large-scale computing systems with an appropriate balance of performance and cost?

• How can very large systems be constructed given the realities of component failures and repair times?

• Can we support a mix of computationally intensive jobs with ones requiring interactive response?

• How do we control access to the system while enabling sharing?

• Can we deal with bad or unavailable data in a systematic way?

• Can high performance systems be built from heterogenous components?

7 Turning Ideas into Reality

7.1 Developing a Prototype System

Operate two types of partitions: some for application development, focusing on gaining experience with the different programming techniques, and others for systems research, studying fundamental issues in system design.

For the program development partitions:

Use available software, such as the open source code from the Hadoop project, to implement the file system and support for application programming.

For the systems research partitions:

Create our own design, studying the different layers of hardware and system software required to get high performance and reliability. (e.g.: high-end hardware, low-cost component)

7.2 Jump Starting

Begin application development by renting much of the required computing infrastructure:

1. network-accessible storage: Simple Storage System (S3) service

2. computing cycles: Elastic Computing Cloud (EC2) service

(The current pricing for storage is $0.15 per gigabyte per day ($1,000 per terabyte per year), with addition costs for reading or writing the data. Computing cycles cost $0.10 per CPU hour ($877 per year) on a virtual Linux machine.)

Renting problems:

1. The performance of such a configuration is much less than that of a dedicated facility.

2. There is no way to ensure that the S3 data and the EC2 processors will be in close enough proximity to provide high speed access.

3. We would lose the opportunity to design, evaluate, and refine our own system.

7.3 Scaling Up

8 Conclusion

1. We believe that DISC systems could change the face of scientific research worldwide.

2. DISC will help realize the potential all these data such as the combination of sensors and networks to collect data, inexpensive disks to store data, and the benefits derived by analyzing data provides.

sun 2008-04-10 10:21 发表评论

DISC(Data Intensive Super Computing 数据密集型超�U�计��?

sun — Fri, 04 Apr 2008 15:43:00 GMT

Data Intensive System(DIS)

System Challenges�Q?/span>

Data distributed over many disks

Compute using many processors

Connected by gigabit Ethernet (or equivalent)

System Requirements:

Lots of disks

Lots of processors

Located in close proximity

System Comparison:

(i) Data

Conventional Supercomputers

DISC

Data stored in separate repository

No support for collection or management

Brought into system for computation

Time consuming

Limits interactivity

System collects and maintains data

Shared, active data set

Computation colocated with storage

Faster access

(ii) Programing Models

Conventional Supercomputers

DISC

Programs described at very low level

Specify detailed control of processing & communications

Rely on small number of software packages

Written by specialists

Limits classes of problems & solution methods

Application programs written in terms of high-level operations on data

Runtime system controls scheduling, load balancing, …

(iii) Interaction

Conventional Supercomputers

DISC

Main Machine: Batch Access

Priority is to conserve machine resources

User submits job with specific resource requirements

Run in batch mode when resources available

Offline Visualization

Move results to separate facility for interactive use

Interactive Access

Priority is to conserve human resources

User action can range from simple query to complex computation

System supports many simultaneous users

Requires flexible programming and runtime environment

(iv) Reliability

Conventional Supercomputers

DISC

“Brittle” Systems

Main recovery mechanism is to recompute from most recent checkpoint

Must bring down system for diagnosis, repair, or upgrades

Flexible Error Detection and Recovery

Runtime system detects and diagnoses errors

Selective use of redundancy and dynamic recomputation

Replace or upgrade components while system running

Requires flexible programming model & runtime environment

Comparing with Grid Computing:

Grid: Distribute Computing and Data

(i) Computation: Distribute problem across many machines

Generally only those with easy partitioning into independent subproblems

(ii) Data: Support shared access to large-scale data set

DISC: Centralize Computing and Data

(i) Enables more demanding computational tasks

(ii) Reduces time required to get data to machines

(iii) Enables more flexible resource management

A Commercial DISC

Netezza Performance Server (NPS)

Designed for “data warehouse” applications

Heavy duty analysis of database

Data distributed over up to 500 Snippet Processing Units

Disk storage, dedicated processor, FPGA controller

User “programs” expressed in SQL

Constructing DISC

Hardware: Rent from Amazon

Elastic Compute Cloud (EC2)

Generic Linux cycles for $0.10 / hour ($877 / yr)

Simple Storage Service (S3)

Network-accessible storage for $0.15 / GB / month ($1800/TB/yr)

Software: utilize open source

Hadoop Project

Open source project providing file system and MapReduce

Supported and used by Yahoo

Implementing System Software

Programming Support

Abstractions for computation & data representation

E.g., Google: MapReduce & BigTable

Usage models

Runtime Support

Allocating processing and storage

Scheduling multiple users

Implementing programming model

Error Handling

Detecting errors

Dynamic recovery

Identifying failed components

sun 2008-04-04 23:43 发表评论