今天將Hadoop下載下來學習了一下文檔中的tutorial,然后仿照如下鏈接實現了一個word count的例子:
以下是一部分理論學習:
The storage is provided by HDFS, and analysis by MapReduce.
MapReduce is a good fit for problems
that need to analyze the whole dataset, in a batch fashion, particularly for ad hoc analysis.
An RDBMS is good for point queries or updates, where the dataset has been indexed
to deliver low-latency retrieval and update times of a relatively small amount of
data.
MapReduce suits applications where the data is written once, and read many
times, whereas a relational database is good for datasets that are continually updated.
MapReduce tries to colocate the data with the compute node, so data access is fast
since it is local.* This feature, known as data locality, is at the heart of MapReduce and
is the reason for its good performance.
Hadoop divides the input to a MapReduce job into fixed-size pieces called input
splits, or just splits. Hadoop creates one map task for each split, which runs the userdefined
map function for each record in the split.
On the other hand, if splits are too small, then the overhead of managing the splits and
of map task creation begins to dominate the total job execution time.For most jobs, a
good split size tends to be the size of a HDFS block, 64 MB by default.
Reduce tasks don’t have the advantage of data locality—the input to a single reduce
task is normally the output from all mappers.
Many MapReduce jobs are limited by the bandwidth available on the cluster, so it pays
to minimize the data transferred between map and reduce tasks. Hadoop allows the
user to specify a combiner function to be run on the map output—the combiner function’s
output forms the input to the reduce function.
Why Is a Block in HDFS So Large?
HDFS blocks are large compared to disk blocks, and the reason is to minimize the cost
of seeks. By making a block large enough, the time to transfer the data from the disk
can be made to be significantly larger than the time to seek to the start of the block.
Thus the time to transfer a large file made of multiple blocks operates at the disk transfer
rate.
A quick calculation shows that if the seek time is around 10ms, and the transfer rate is
100 MB/s, then to make the seek time 1% of the transfer time, we need to make the
block size around 100 MB. The default is actually 64 MB, although many HDFS installations
use 128 MB blocks. This figure will continue to be revised upward as transfer
speeds grow with new generations of disk drives.
This argument shouldn’t be taken too far, however. Map tasks in MapReduce normally
operate on one block at a time, so if you have too few tasks (fewer than nodes in the
cluster), your jobs will run slower than they could otherwise.
意思是這樣的,Block大的話,尋找Block的時間大概少,主要耗在傳輸的時間上,但是如果Block小的話,傳輸的時間和尋址的時間就相當了,等于說就是消耗的時間是2倍傳輸的時間,劃不來。具體的說是,如果數據量為100M,那么Block的大小是100M,那么傳輸的時間就是1s(100M/s),但是如果Block的大小是1M,那么傳輸的時間還是1s,但是seek的時間10ms*100=1s了。這樣總共花去的時間就是2s。是不是越大越好呢?也不是,太大的話,可能導致文檔沒有分布式的存儲,也就沒有很好的利用MapReduce模型進行計算了,反而可能更慢。