<rt id="bn8ez"></rt>
<label id="bn8ez"></label>

  • <span id="bn8ez"></span>

    <label id="bn8ez"><meter id="bn8ez"></meter></label>

    paulwong

    Analyzing Apache logs with Pig



    Analyzing log files, churning them and extracting meaningful information is a potential use case in Hadoop. We don’t have to go in for MapReduce programming for these analyses; instead we can go for tools like Pig and Hive for this log analysis. I’d just give you a start off on the analysis part. Let us consider Pig for apache log analysis. Pig has some built in libraries that would help us load the apache log files into pig and also some cleanup operation on string values from crude log files. All the functionalities are available in the piggybank.jar mostly available under pig/contrib/piggybank/java/ directory. As the first step we need to register this jar file with our pig session then only we can use the functionalities in our Pig Latin
    1.       Register PiggyBank jar
    REGISTER /usr/lib/pig/contrib/piggybank/java/piggybank.jar;
    Once we have registered the jar file we need to define a few functionalities to be used in our Pig Latin. For any basic apache log analysis we need a loader to load the log files in a column oriented format in pig, we can create a apache log loader as
    2.       Define a log loader
    DEFINE ApacheCommonLogLoader org.apache.pig.piggybank.storage.apachelog.CommonLogLoader();
    (Piggy Bank has other log loaders as well)
    In apache log files the default format of date is ‘dd/MMM/yyyy:HH:mm:ss Z’ . But such a date won’t help us much in case of log analysis we may have to extract date without time stamp. For that we use DateExtractor()
    3.       Define Date Extractor
    DEFINE DayExtractor org.apache.pig.piggybank.evaluation.util.apachelogparser.DateExtractor('yyyy-MM-dd');
    Once we have the required functionalities with us we need to first load the log file into pig
    4.       Load apachelog file into pig
    --load the log files from hdfs into pig using CommonLogLoader
    logs = LOAD '/userdata/bejoys/pig/p01/access.log.2011-01-01' USING ApacheCommonLogLoader AS (ip_address, rfc, userId, dt, request, serverstatus, returnobject, referersite, clientbrowser);
    Now we are ready to dive in for the actual log analysis. There would be multiple information you need to extract out of a log; we’d see a few of those common requirements out here
    Note: you need to first register the jar, define the classes to be used and load the log files into pig before trying out any of the pig latin below
    Requirement 1: Find unique hits per day
    PIG Latin
    --Extracting the day alone and grouping records based on days
    grpd = GROUP logs BY DayExtractor(dt) as day;
    --looping through each group to get the unique no of userIds
    cntd = FOREACH grpd
    {
                    tempId =  logs.userId;
                    uniqueUserId = DISTINCT tempId;
                    GENERATE group AS day,COUNT(uniqueUserId) AS cnt;
    }
    --sorting the processed records based on no of unique user ids in descending order
    srtd = ORDER cntd BY cnt desc;
    --storing the final result into a hdfs directory
    STORE srtd INTO '/userdata/bejoys/pig/ApacheLogResult1';
    Requirement 1: Find unique hits to websites (IPs) per day
    PIG Latin
    --Extracting the day alone and grouping records based on days and ip address
    grpd = GROUP logs BY (DayExtractor(dt) as day,ip_address);
    --looping through each group to get the unique no of userIds
    cntd = FOREACH grpd
    {
                    tempId =  logs.userId;
                    uniqueUserId = DISTINCT tempId;
                    GENERATE group AS day,COUNT(uniqueUserId) AS cnt;
    }
    --sorting the processed records based on no of unique user ids in descending order
    srtd = ORDER cntd BY cnt desc;
    --storing the final result into a hdfs directory
    STORE srtd INTO '/userdata/bejoys/pig/ ApacheLogResult2 ';
    Note: When you use pig latin in grunt shell we need to know a few factors
    1.       When we issue a pig statement in grunt and press enter only the semantic check is being done, no execution is triggered.
    2.       All the pig statements are executed only after the STORE command is submitted, ie map reduce programs would be triggered only after STORE is submitted
    3.       Also in this case you don’t have to load the log files again and again to pig once it is loaded we can use the same for all related operations in that session. Once you are out of the grunt shell the loaded files are lost, you’d have to perform the register and log file loading steps all over again.

    posted on 2013-04-08 02:06 paulwong 閱讀(355) 評論(0)  編輯  收藏 所屬分類: PIG

    主站蜘蛛池模板: 思思re热免费精品视频66| 五月婷婷亚洲综合| 国产精品久久久久久亚洲影视| 免费少妇a级毛片人成网| 午夜理伦剧场免费| 含羞草国产亚洲精品岁国产精品| 亚洲精品无码精品mV在线观看| 日本三级2019在线观看免费| 色老头综合免费视频| 亚洲成a人片在线观看中文!!!| 亚洲乱码日产精品a级毛片久久| 中文字幕天天躁日日躁狠狠躁免费| 国产精品亚洲一区二区三区在线观看 | 黄色短视频免费看| 亚洲日韩国产精品无码av| 亚洲黄片毛片在线观看| 国产在线a免费观看| 高清永久免费观看| 亚洲精品无码久久| 亚洲黄色在线观看视频| 亚洲午夜av影院| 69成人免费视频无码专区| 在线毛片片免费观看| 激情小说亚洲图片| 亚洲精品国产电影午夜| 国产成人精品日本亚洲专区61| 成人免费a级毛片| 最近免费2019中文字幕大全| 4hu四虎免费影院www| 亚洲欧美第一成人网站7777| 亚洲美女视频网站| 亚洲国产精品VA在线看黑人| 四虎在线播放免费永久视频 | 亚洲国产主播精品极品网红| 毛片免费在线观看网址| 91制片厂制作传媒免费版樱花 | 亚洲第一区精品日韩在线播放| 免费a级毛片无码a∨蜜芽试看| 91精品视频在线免费观看| a毛片在线看片免费| xxxxx做受大片视频免费|