亚洲成av人影院,亚洲JIZZJIZZ中国少妇中文,亚洲一级黄色大片

Hadoop: The Definitive Guide（Hadoop權威指南）Unix Tools 腳本編程實施

Posted on 2012-03-29 16:21 一酌散千憂閱讀(490) 評論(0) 編輯收藏所屬分類: Hadoop

Example 2-2. A program for finding the maximum recorded temperature by year from NCDC weather records

#!/usr/bin/env bash

for year in all/*

echo -ne `basename $year .gz`"\t"

gunzip -c $year | \

awk '{ temp = substr($0, 88, 5) + 0;

q = substr($0, 93, 1);

if (temp !=9999 && q ~ /[01459]/ && temp > max) max = temp }

END { print max }'

done

使用linux腳本打印每年最高溫度，先解釋一下該腳本幾個注意點。

腳本目的是發現每年的最高溫度，第一句for year in 后的all/*表示在名稱為all的文件夾下每年度的溫度信息都以如 1990.gz 方式存在。使用gunzip方式解壓并打印，對打印的內容使用awk函數進行處理，獲取最大溫度，單個文件處理完畢后打印max。

在上一篇中獲取的數據包是這樣，年度為文件夾，當中包含若干個溫度詳情文件。

E:\testData\1990\010010-9999-1990.gz

E:\testData\1990\010014-9999-1990.gz

E:\testData\1990\010015-9999-1990.gz

E:\testData\1990\010016-9999-1990.gz

…

從后面Appendix C的描述中得知，實際上作者對這樣的數據進行了處理，因為hadoop在處理大量的小文件時無法達到很高的效率，因此作者使用hadoop將小文件合并，并且給出了代碼。

我比較希望能夠使用腳本處理，將所有的gz解壓之后，合并成為一個文件，打包成gz的格式，這樣就能完全符合之前那段腳本的處理方式。所以，腳本如下：

packyear

#! /bin/sh

# /usr/data/packyear

# unzip all gz files in data

for yeards in data/*

# unzip all gz files in year directory

for gzfile in $yeards/*

gunzip $gzfile

done

# cat all content to year file

cat $yeards/* | head -2 >> $yeards.tc

# remove year directory

rm -rf $yeards

mv $yeards.tc $yeards

# zip the tc file

gzip $yeards

done

根據實際路徑改寫的計算最大溫度的腳本

maxyear

#! /bin/sh

# /usr/data/ maxyear

for year in /usr/data/*

basename $year .gz

gunzip -c $year | \

awk '{temp=substr($0, 88, 5)+0;

q=substr($0, 93, 1);

if(temp !=9999 && q ~ /[01459]/ && temp > max) max = temp}

END {print max}'

done

這個腳本最終顯示出來會是：

1990

這樣的格式。由于對數據結構的不熟悉，所以不確定顯示出來的數據是否正確，但是基本的腳本和數據操作方式就是這樣了。

新用戶注冊刷新評論列表


只有注冊用戶登錄后才能發表評論。




網站導航: 博客園 IT新聞 Chat2DB C++博客博問管理
相關文章: 近期HADOOP（1.0.3）實施心得與總結 Hadoop完整分布式配置方式(Fully distributed mode) Zookeeper的學習總結 Hadoop in action 實踐(偽分布式) Hadoop: The Definitive Guide（Hadoop權威指南）Unix Tools 腳本編程實施 Hadoop: The Definitive Guide（Hadoop權威指南）數據獲取

云

Hadoop: The Definitive Guide（Hadoop權威指南）Unix Tools 腳本編程實施

日歷

公告

常用鏈接

留言簿(1)

隨筆分類(17)

隨筆檔案(14)

文章分類(1)

文章檔案(1)

搜索

最新評論

閱讀排行榜

評論排行榜