??xml version="1.0" encoding="utf-8" standalone="yes"?>
Hadoop周刊 W?/span> 176 ?/span>
启明星辰q_和大数据Ml编?/span>
2016q?/span>6?/span>29?/span>
HadoopC(x)本周在圣何塞召开Q所以很期待在下期周刊看到新目的发布和_ֽ演讲Q请向我们提供Q何相关的qȝ片)(j)。至于本期周刊,有大量关?/span>Kafka Streams、从Amazon Kinesis?/span>Google BigQuery传递流式数据?/span>Google数据集搜索系l的文章?/span>
技术新?/span>
Shine介绍?jin)他们如何?/span>Amazon Lambda?/span>Amazon KinesisQ以?qing)?f)Apache web服务器提供的Kinesis代理Q用于采日志Q?/span>Q以?qing)?/span>EC2Ud数据?/span>Google BigQuery的内宏V本文提供了(jin)Lambda函数Q?/span>javascript~写Q代码片D,规模和开销斚w的信息,描述?jin)如何通过gzip压羃数据从而优化传输开销?/span>
https://blog.shinetech.com/2016/06/21/kinesis-lambda-bigquery/
Cloudera博客撰文介绍?jin)如何通过Apache Spark?/span>Apache ImpalaQ孵化中Q?/span>HueҎ(gu)之队数据q行分析。本文主要聚焦在分析上,附带?jin)?/span>Spark代码以及(qing)Hue的功能演C?/span>
http://blog.cloudera.com/blog/2016/06/how-to-analyze-fantasy-sports-with-apache-spark-and-sql-part-2-data-exploration/
KDnuggets撰文介绍?/span>13个和Apache Spark相关的主?/span>API/目/名词。包?/span>RDD?/span>DataFrame?/span>Dataset、结构化式计算?/span>GraphX?/span>Tungsten。每个条目都有一D늫节介l,_很好的了(jin)?/span>Spark主要Ҏ(gu)了(jin)?/span>
http://www.kdnuggets.com/2016/06/spark-key-terms-explained.html
本文来自Confluent博客Q介l了(jin)那些虽看h单却又不单的Kafka Streams应用。例如用Kafka Streams~写l合用户点击数据和用户位置数据的程序。后者存储在KTable中,KTable提供?jin)类似带有数据库表主键的抽象Q主键的最新值通过API暴露Q。最后的E序倒是?/span>——只有几行代码?/span>
http://www.confluent.io/blog/distributed-real-time-joins-and-aggregations-on-user-activity-events-using-kafka-streams
Cloudera博客撰文介绍?/span>meinstadt.de构徏?/span>Apache Flume?/span>Apache Spark Streaming?/span>Apache ImpalaQ孵化中Q上?/span>HTTPh异常(g)系l。实C码放在了(jin)github上?/span>
http://blog.cloudera.com/blog/2016/06/how-to-detect-and-report-web-traffic-anomalies-in-near-real-time/
AWS大数据博客有教程介绍?jin)如何?/span>Apache Spark?/span>Apache Zeppelin?/span>Amazon EMR集群处理Amazon Kinesis数据。本文包含了(jin)一些通过Zeppelin notebookq行SQL产生的数据可视化范例?/span>
http://blogs.aws.amazon.com/bigdata/post/Tx3K805CZ8WFBRP/Analyze-Realtime-Data-from-Amazon-Kinesis-Streams-Using-Zeppelin-and-Spark-Strea
Apache KuduQ孵化中Q接q?/span>1.0版发布了(jin)Q将全面支持高可用性。本文介l了(jin)q最后一块拼?/span>“d?/span>”是如何实现的。晒?jin)?/span>JIRA上各U问题的跟进的情况,以及(qing)完成与剩余的试?/span>
http://kudu.apache.org/2016/06/24/multi-master-1-0-0.html
Google的所有数据^台拥有超q?/span>260亿的数据集,每天要添加和删除16亿的数据集\径。ؓ(f)?jin)跟t、查询、比较数据集Q他们研发了(jin)Google Dataset SearchQ?/span>GOODSQ?/span>GOODS跟踪?/span>API暴露的元数据Q这些元数据被用于检索、监控等?/span>
http://dl.acm.org/citation.cfm?id=2903730
其他新闻
SiliconAngle采访?/span>Hortonworks CEO Rob Bearden。主题包括业界趋ѝ?/span>Hortonworks财务?/span>Hortonworks的非Hadoop技术以?qing)物联网?/span>
http://siliconangle.com/blog/2016/06/24/hadoop-and-beyond-a-conversation-with-hortonworks-ceo-rob-bearden/
产品发布
Apache Sentry本周发布?/span>1.7.0版,修复?/span>bugQ增加了(jin)新特性和其他斚w的提升。本ơ发布把Hive授权框架升C(jin)W二版?/span>
http://mail-archives.us.apache.org/mod_mbox/www-announce/201606.mbox/%3CCAPOmu3sDqdzu9ntDSvkMaDRQnVfHrkGV5qhyh-ZRiMmwgMMvBA@mail.gmail.com%3E
ZApache Cassandra 3.0构徏?/span>DataStax Enterprise 5.0Q增加了(jin)对图数据、分层存储?/span>Cassandra多实例的支持。本ơ发布也增加?jin)诸如加密和Z角色讉K控制的附加安全特性支持?/span>
https://www.datastax.com/2016/06/introducing-datastax-enterprise-5-0
DrivenQ大数据应用性能监控pȝ发布?/span>2.2版。本ơ发布的亮点是对Apache Spark的监控提供了(jin)支持?/span>
BlueData发布?jin)他们?f)Amazon Web Services提供?/span>EPIC企业大数据既服务产品。本产品通过单的点击p自动装蝲到基?/span>Docker?/span>Hadoop集群?/span>
http://www.bluedata.com/blog/2016/06/big-data-as-a-service-on-prem-or-cloud-bdaas/
Apache Accumulo发布?/span>1.7.2版。本ơ发布修复了(jin)write-ahead日志处理方式Q优化了(jin)RFilesQ以?qing)性能上的提升?/span>
https://accumulo.apache.org/release_notes/1.7.2.html
Apache ZooKeeper的顶U?/span>SDKQ?/span>Apache Curator发布?/span>2.11.0?/span>3.2.0版?/span>
https://cwiki.apache.org/confluence/display/CURATOR/Releases#Releases-June23,2016,Releases2.11.0and3.2.0available
Apache Hive发布?/span>2.1.0版。修复了(jin)大量bug和功能增强,包括?/span>Hive?/span>Live Longer?/span>Prosper 改进和以?/span>JDBC支持?/span>
zd
中国
7?/span>2?/span> 上vBigData StreamingW三ơ见面会(x)
Hadoop周刊 W?/span> 175 ?/span>
启明星辰q_和大数据Ml编?/span>
2016q?/span>6?/span>19?/span>
HadoopC(x)已过M周了(jin)Q我们已看到有多个品(目Q敲定了(jin)发布旉。所以在技术新闻部分,有关?/span>Hadoop Kerberos认证的内容另外还?/span>Salsify应用Avro的文章。在产品发布部分Q包?/span>Yandex新近开源的列式数据库在内的多个目均有新版本发布?/span>
技术新?/span>
OpenCore博客撰文C?jin)多U?/span>Hadoop Kerberos认证协议调试工具。尤其示范了(jin)如何使用UserGropuInformation?/span>“main()”Ҏ(gu)导出一些有用的调试信息?/span>
http://www.opencore.com/blog/2016/5/user-name-handling-in-hadoop/
YARNpd文章的第四部分,Cloduera博客介绍?jin)如何配|公q度队列。尤其对资源U束讄、队列安|策略和抢占q行?jin)详解?/span>
SalsifyZApache Kafka构徏?jin)一个异步微服务架构Qƈ采用Apache Avroq行数据序列化。该应用使用Ruby开发,他们创徏?jin)多个新工具使?/span>Avro能和Ruby语言很好的配合。本文介l了(jin)q些工具和它们的价|(x)avro-builder用于定义记录、基?/span>postgres的模式注册表Q?/span>avromatic则从avro schema生成模型?/span>
http://blog.salsify.com/engineering/adventures-in-avro
Apache Drill可以动态推断模式,q支持多模式(但相互兼?/span>)数据。这U组合得一些有的用例得以实现Q例如跨多个不同模式?/span>json文g查询?/span>MapR博客探究?jin)这些特性ƈq行?jin)示范?/span>
https://www.mapr.com/blog/sql-query-mixed-schema-data-using-apache-drill
本教E展CZ(jin)如何?/span>Druid?/span>Apache Kafkal合构徏式分析和可视化Q借助PivotQ?/span>Druid?/span>web UIQ应用?/span>
http://www.confluent.io/blog/building-a-streaming-analytics-stack-with-apache-kafka-and-druid
Apache BeamQ孵化中Q博客撰文介l了(jin)他们在连?/span>Apache Flink批处理集方面的成果?/span>Beam是一个开?/span>SDKQ最初来自于GoogleQ用于暴露后端未知数据管?/span>API?/span>
http://beam.incubator.apache.org/blog/2016/06/13/flink-batch-runner-milestone.html
Cask Hydrator是一个通过UI界面采用拖拽方式构徏数据道的工兗本教程也演CZ(jin)如何使用Hydrator把数据从MySQL导入?/span>HDFS?/span>
http://blog.cask.co/2016/06/bringing-relational-data-into-data-lakes/
Databricks撰文介绍?jin)即发布?/span>Apache Spark 2.0中新?/span>SQL子查询功能。有的是,本文以手册Ş式呈玎ͼ最直截?jin)当的展C(jin)代码和范例数据?/span>
https://databricks.com/blog/2016/06/17/sql-subqueries-in-apache-spark-2-0.html
Apache KuduQ孵化中Q博客撰写了(jin)在单集群节点使用Raft的文章,借此动态扩展到多主节点集群?/span>
http://getkudu.io/2016/06/17/raft-consensus-single-node.html
其他新闻
本文指出Apache SparkC如果不用?j)经营,可能会(x)重走因片化导?/span>Apache Hadoop生态系lq老\。D例来_(d)最新版本的CDH?/span>HDP支持不同版本?/span>Spark?/span>
https://techcrunch.com/2016/06/12/spark-fragmentation-undermines-community/
New Stack撰写?jin)一关?/span>Concord的文章,Concord是一个构建在Apache Mesos上新的流式处理框Ӟ公开试状态)(j)?/span>Concord使用C++开发,支持动态拓扑(无需停机实现道的增加和减少Q?/span>
http://thenewstack.io/concord-leverages-mesos-high-performance-stream-processing/
随着DatabricksC版的正式发布Q?/span>Databricks发布?jin)?/span>Databricks~写Apache Spark应用E序pd教程的第一?/span>
https://databricks.com/blog/2016/06/15/an-introduction-to-writing-apache-spark-applications-on-databricks.html
Hadoop圣何塞峰?x)于几周前召开Q期间D行了(jin)题(sh)ؓ(f)“大数据行业中的女?/span>”专场午宴?/span>Hortonworks博客Ҏ(gu)采访?jin)午宴主持hHortonworks CMOQ?/span>Ingrid Burton?/span>
http://hortonworks.com/blog/summer-hortonworks-part-2-wibd-assertive-innovative-take-risks/
产品发布
Apache SystemMLQ孵化中Q最q发布了(jin)0.10.0版?/span>SystemML是一个机器学?fn)框Ӟ由多个项目在背后支撑Q包?/span>Apache Spark?/span>Apache Hadoop。本ơ发布包括新?/span>Spark Matrix Blockcd、支持深度学?fn)、性能上的提升、新?/span>KNN法{等?/span>
http://systemml.apache.org/0.10.0-incubating/release_notes.html
Apache MahoutQ另一个机器学?fn)框架发布?jin)0.12.2版。本ơ发布向着集成Apache Zeppelin可视化和支持notebook的目标迈q了(jin)一步?/span>
http://mail-archives.us.apache.org/mod_mbox/www-announce/201606.mbox/%3CCAOtpBjgBAuQs5FiX5X_5A+Rd-A1fVz0R7SKttGe4cJuCLRiGww@mail.gmail.com%3E
Qubole宣布他们?/span>HBase-as-a-Service已经?/span>AWS上提供。它为长时运行集提供了(jin)许多漂亮的特性。支?/span>Hannibal和其它监控工P集成?/span>Apache ZeppelinQƈ能通过节点引导E序?/span>OpenTSDB?/span>Apache Phoenix配置?/span>
https://www.qubole.com/blog/product/quboles-hbase-as-a-service-is-generally-available-on-aws/
Altiscale发布?/span>Altiscale Insight Cloud实时版。本pȝ?/span>Apache HBase?/span>Spark Streaming支撑?/span>
https://www.altiscale.com/blog/announcing-the-altiscale-insight-cloud-real-time-edition/
`hs2client`是一个ؓ(f)Apache Hive?/span>Apache ImpalaQ孵化中Q提供的?/span>C++库。除?jin)支?/span>C++Q这个库q绑定了(jin)pythonQ可以在pandas中把数据dDataFrame?/span>
MapR在其发行版中支持?/span>Apache Spark 2.0开发者预览版?/span>
https://www.mapr.com/blog/spark-20-now-developer-preview-mode-mapr-platform
Apache Beam发布?jin)?/span>0.1.0孵化版,是本目加入Apache孵化器以来首ơ发布?/span>
http://beam.incubator.apache.org/beam/release/2016/06/15/first-release.html
Yandex开源了(jin)ClickHouseQ一个列式分析数据库。本pȝ为横向和U向扩展而生。支持复杂数据类型(例如数组Q和q似查询。该团队q发布了(jin)与其它数据库相比的基准测试结果?/span>
https://clickhouse.yandex/
zd
中国
Hadoop周刊 W?/span> 174 ?/span>
启明星辰q_和大数据Ml编?/span>
2016q?/span>6?/span>12?/span>
SparkC(x)本周在旧金山召开Q正如所料,本期周刊有大量关?/span>Apache Spark的新闅R公告和版本发布。除Spark外,本期q有Kafka?/span>Cask?/span>Ambari斚w的文章。在产品发布部分Q有一q来Apache Pig首次版本更新Q还?sh)个ؓ(f)分布式系l设计的z新工具RunwayQ最后是新版Apache KuduQ孵化中Q?/span>
技术新?/span>
Debezium是一个相对较新的目Q用于数据库?/span>Apache Kafka topic行改变数据捕获。当面支?/span>MySQL?/span>Zookeeper?/span>KafkaQ这是一在Docker?/span>Kubernetes容器上配|?/span>Zookeeper, Kafka, MySQL的教E?/span>
http://debezium.io/blog/2016/05/31/Debezium-on-Kubernetes/
有些人对Apache Kafka目宣布采用另一U流式处理引擎感到惊Ӟq就?/span>Kafka Streams?/span>Kafka Streams与其它系l存在显著的关键差异。本文很好的C?jin)这些不同?/span>——abstraction、部|模型、支持基于状态的计算?/span>
https://softwaremill.com/kafka-streams-how-does-it-fit-stream-landscape/
每个使用MapReduce?/span>Spark或类似系l的人都?x)陷入难以调试、数据特?/span>bugq些问题?sh)?/span>BigDebug?/span>UCLAQ加州大学洛杉矶分校Q的研究目/论文Q旨在让开发h员通过工具发现单机问题Q传入参数导致的崩溃Q跟t、断炏V观察点、gq报警等。该工具支持Apache Spark 1.2.1上?/span>
https://blog.acolyer.org/2016/06/07/bigdebug-debugging-primitives-for-interactive-big-data-processing-in-spark/
Cask撰文介绍?jin)在开?/span>Cask Data Application Platform (CDAP)中运?/span>Spark的文章。运行在CDAP?/span>SparkE序通过讉KApache TephraQ孵化中Q实现细_度事务支持。这Pp很容易利用快照隔dC一个表复制到另一个表的一致性?/span>CDAP中的Spark也能讉KCask TrackerQ?/span>Cask Tracker提供数据血~(sh)息(什么时候创建、用等Q。根据应用的不同Q?/span>CDAP工具q能发挥更大价倹{?/span>
http://blog.cask.co/2016/06/cdap-spark-prototype-to-production/
IBM Hadoop Dev博客撰写?jin)?/span>cURL调用Ambari REST API的教E。还C?jin)?/span>vanilla和启用了(jin)kerberos的集上建立?x)话Qƈ为接下来的请求复用会(x)话?/span>
https://developer.ibm.com/hadoop/2016/06/07/ambari-rest-calls-for-kerberos-enabled-clusters/
Google云^台博客撰文介l了(jin)如何调试q行?/span>Google Dataflow上的Apache BeamQ孵化中QQ务。ؓ(f)?jin)调试性能瓉Q?/span>Dataflow有一些有用的l计数据?/span>UI来帮助用者深入每一个步骤?/span>
https://cloud.google.com/blog/big-data/2016/06/understanding-timing-in-cloud-dataflow-pipelines
其他新闻
Transaction Processing Performance Council(TPC)发布?/span>TPCx-BB基准试Q该基准试为大数据pȝ设计。除?jin)衡?/span>SQL外,q可以对机器学习(fn)集群和分c问题进行测试?/span>
http://www.datanami.com/2016/06/01/big-data-benchmark-gauges-hadoop-platforms/
伦敦Strata + Hadoop世界大会(x)两周前已召开。演讲者的专题报告和灯片已发布到?x)议|站上?/span>
http://conferences.oreilly.com/strata/hadoop-big-data-eu/public/schedule/proceedings
Splice MachineQ?/span>Hadoop上的RDBMS构徏者,宣布开源他们的软g。当前,他们正在L贡献?/span>/导师/豪杰来提升开源后的效果?/span>Splice Machine有不有的Ҏ(gu),例如ACID事务Q二U烦(ch)引,引用完整性?/span>
http://www.splicemachine.com/were_going_open_source/
Altiscale博客~辑?jin)许多关于客h务、情感分析、气候变化、智慧城?jng)?/span>bias{方面的大数据应用案例文章。还攉?jin)一些大数据怀疑论者的文章?/span>
https://www.altiscale.com/blog/big-data-news-health-and-public-safety-sentiment-analysis-fixing-education-2/
SparkC(x)本周在旧金山召开。会(x)议组l?/span>Databricks概述?jin)两天内的热点内容,链接了(jin)许多的演讲和专题报告?/span>
https://databricks.com/blog/2016/06/08/another-record-setting-spark-summit.html
大数据即?/span>?/span>QBDaaSQ公?/span>QuboleQ撰文介l了(jin)他们的客户如何接受?/span>Spark。接受速度之快——一半多的客L(fng)在开始用Spark?/span>Qubole也支?/span>PrestoQ他们也看到?jin)类似的增长?/span>
https://www.qubole.com/blog/big-data/spark-usage/
Twitter?/span>Apache孵化器提交了(jin)他们的复制日志服?/span>DistributedLog?/span>
https://wiki.apache.org/incubator/DistributedLogProposal
Big Data Day LA?/span>6?/span>9日在西洛杉矶学院召开。这ơ活动是免费的(如果预先注册的话Q,演讲者来自于Confluent?/span>Databricks?/span>Yahoo?/span>Netflix{?/span>
http://www.bigdatadayla.com/
产品发布
Apache Spark发布?/span>Spark 2.0预览版。发布声明中说道API和功能都未最l敲定?/span>
https://spark.apache.org/news/spark-2.0.0-preview.html
JustOne构徏q开源了(jin)Kafka-to-PostgreSQLq接器。本文介l了(jin)该连接器的性能Q详l描qC(jin)如何把消息{换ؓ(f)行,q描qC(jin)如何讑֮配置{?/span>
http://www.confluent.io/blog/kafka-connect-sink-for-postgresql-from-justone-database
Salesforce开源了(jin)RunwayQ这是一个徏模、仿真以?qing)可视化分布式系l。在runway.system上有一个在U演C环境,演示?/span>“too many bananas”模型Q电(sh)梯系l和Raft一致性系l?/span>
https://medium.com/salesforce-open-source/runway-intro-dc0d9578e248
Bloomberg最q开源了(jin)Presto AccumuloQ面?/span>Apache Accumulo?/span>Prestoq接器。在声明中,链接?/span>11늚论文Q比较了(jin)Z?/span>Presto查询和基?/span>Accumulo Java API查询的基准测试结果?/span>
http://www.bloomberg.com/company/announcements/open-source-at-bloomberg-reducing-application-development-time-via-presto-accumulo/
?/span>?/span>Azure发布?jin)基?/span>Apache Spark 1.6.1 E_版的Azure HDInsight。本ơ发布支持了(jin)面向Spark?/span>Project Livy RESTd服务支持Q集成了(jin)Azure数据湖存储(Z角色的访问控Ӟ(j)Q集成了(jin)IntelliJQ支持了(jin)JupyterW记本等?/span>
https://azure.microsoft.com/en-us/blog/apache-spark-for-azure-hdinsight-now-generally-available/
LinkedIn开源了(jin)Photon MLQ他们的大规模回归分析库?/span>Photon构徏?/span>Spark之上q在LinkedIn?/span>YARN上运行(q去ZMapReduceQ似乎因提升性能才迁U)(j)?/span>
https://engineering.linkedin.com/blog/2016/06/open-sourcing-photon-ml
Hortonworks发布?/span>Spark-HBaseq接器的技术预览版。预览版原生支持AvroQ支持运行安全集,原生支持Spark Datasource APIQƈ优化?jin)分Z剪,列修剪,谓词下推?/span>
http://hortonworks.com/blog/spark-hbase-dataframe-based-hbase-connector/
Databricks发布?/span>Apache Sparkq_的第一阶段安全Ҏ(gu)。本阶段寚w?/span>ACL?/span>SAML 2.0q行?jin)支持,端对端的审计日志?/span>
https://databricks.com/blog/2016/06/08/achieving-end-to-end-security-for-apache-spark-with-databricks.html
Apache ORC 1.1.0版发布了(jin)。本ơ发布完成了(jin)从基?/span>Apache Hive的代码到ZJava的代码迁U,修正?/span>C++旉戛_理程序,增加?/span>Hadoop MapReduceq接器?/span>
http://orc.apache.org/news/2016/06/10/ORC-1.1.0/
Apache Kudu发布?/span>0.9.0版。增加了(jin)UPSERT命o(h)Q新?/span>Spark数据源不?x)依?/span>MapReduce APIQ提升了(jin)Tablet Server写性能?/span>
http://getkudu.io/2016/06/10/apache-kudu-0-9-0-released.html
Google云服务^台团队发布了(jin)支持Spark 2.0预览版的Google Cloud Dataproc?/span>
https://cloud.google.com/blog/big-data/2016/06/google-cloud-dataproc-the-fast-easy-and-safe-way-to-try-spark-20-preview
DoryQ?/span>Bruce的承者)(j)Kafka producer的守护进E,现在支持?/span>UNIX domain sockets或本?/span>TCP接收数据?jin)?/span>
http://mail-archives.apache.org/mod_mbox/kafka-users/201606.mbox/%3C1465683894.608424023@apps.rackspace.com%3E
Apache Pig 0.16.0版,一q来首次发布。坚定了(jin)?/span>Tez的支持?/span>
http://pig.apache.org/releases.html#8+June%2C+2016%3A+release+0.16.0+available
zd
中国
Spark Meetup (上v) – 周六, 6?/span>18?/span>
Hadoop周刊 W?/span> 173 ?/span>
启明星辰q_和大数据Ml编?/span>
2016q?/span>6?/span>5?/span>
本周Q?/span>Spark?/span>NiFi?/span>Netflix Meson?/span>Storm斚w只有量内容?/span>SparkC(x)本周在旧金山召开Q所以呢Q下周肯定有不少内容?/span>
技术新?/span>
Databricks博客介绍?/span>Apache Spark 2.0的新Ҏ(gu)?/span>——跨语a支持存储和加载机器学?fn)模型。模型通过单的API被存储和加蝲Q模型的元数据与参数保存?sh)?/span>JSON风格Q模型的数据保存?sh)?/span>Parquet风格?/span>
https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html
https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html
Meson?/span>Netflix用于执行机器学习(fn)工作的框架。它?/span>Apache Hive?/span>Spark?/span>Mesosq些大数据技术之间的_合剂。工作流使用DSLq行~写Q?/span>Mesonq提供了(jin)更加先进的流水线可视?/span>UI?/span>Netflix目前没开?/span>MesonQ但他们有这斚w的计划?/span>
http://techblog.netflix.com/2016/05/meson_31.html
IBM Hadoop Dev博客要介l和C?/span>HDFS归档存储能力?/span>
https://developer.ibm.com/hadoop/2016/06/01/use-hdfs-archival-storage/
Apache Storm 1.0有了(jin)令h惊讶的新Ҏ(gu)。本文关注了(jin)几个调试能力斚w的增强:(x)动态日志别、统一日志搜烦(ch)?/span>事g抽样、集?/span>jstack/heap dumps/java飞行记录器分?/span>worker?/span>
http://hortonworks.com/blog/whats-new-apache-storm-1-0-part-1-enhanced-debugging/
Cloudera博客撰文介绍?jin)如何?/span>Apache Spark来探索性分析存储在CSV文g中的NBA历史l计数据。分析过E合用了(jin)Scala?/span>SQL?/span>
http://blog.cloudera.com/blog/2016/06/how-to-analyze-fantasy-sports-using-apache-spark-and-sql/
Apache NiFi作ؓ(f)一U通用工具受到?jin)很多的x(chng)。它?/span>“Z程的处?/span>”而生Q可能对很多人ƈ不意味着什么,?/span>NiFi支持标准?/span>ETLQ流式处理等。许?/span>NiFi例子都示范了(jin)如何?/span>Twitter firehose把数据移动到HDFS中,但本文聚焦在NiFi另外的特性上——C?jin)一些简单的?/span>HTTP拉数据的q程?/span>
http://hortonworks.com/blog/apache-nifi-not-scratch/
Amazon Redshift构徏?/span>PostgreSQL引擎上,所以你可以利用PostgreSQL的扩展功能让Redshift集群q接PostgresSQL实例。这样一来,诸如跨数据库q接、将Redshift的结果{换ؓ(f)JSON、在Postgres中创?/span>Redshift数据视图?/span>
数据库之间复制数据等有趣的应用都能实现?/span>
http://blogs.aws.amazon.com/bigdata/post/Tx1GQ6WLEWVJ1OX/JOIN-Amazon-Redshift-AND-Amazon-RDS-PostgreSQL-WITH-dblink
其他发布
FeatherCast发布?jin)超q?/span>100?/span>ApacheCon北美C(x)的相兛_韟?/span>
http://feathercast.apache.org/tag/apacheconna2016/
InfoWorld介绍?/span>HeronQ?/span>Twitter才开源的Apache Storm兼容目。本文介l了(jin)两个目在架构上的不同。主要指Z(jin)Heronh于几个月前(Storm已发布)(j)Q就是说Storm在特性上?/span>Heron更有优势?/span>
http://www.infoworld.com/article/3078134/analytics/had-it-with-apache-storm-heron-swoops-to-the-rescue.html
Databricks?/span>edX上开?jin)一门新评Q?/span>“Apache Spark入门”。课E从6?/span>15日开始,一直持l两周?/span>
launch-first-of-five-free-big-data-courses-on-apache-spark.html
产品发布
Amazon EMR发布?/span>4.7.0版。本ơ发布支持了(jin)Apache Tez?/span>Apache PhoenixQƈ内置?jin)新版本?/span>Apache HBase?/span>Apache Mahout?/span>Presto。另外,AWS大数据博客还指导?/span>Phoenix如何上手?/span>
http://aws.amazon.com/blogs/aws/amazon-emr-4-7-0-apache-tez-phoenix-updates-to-existing-apps/
Apache Hive本周发布?/span>2.0.1版。从二月发布2.0.0以来Q首ơ小版本发布。本ơ修复了(jin)60?/span>bug?/span>
http://mail-archives.us.apache.org/mod_mbox/www-announce/201605.mbox/%3CD37344A3.77A64%25sershe@apache.org%3E
zd
中国
?/span>
Hadoop周刊 W?/span> 172 ?/span>
启明星辰q_和大数据Ml编?/span>
2016q?/span>5?/span>22?/span>
本周主要x(chng)式计算—— Twitter?/span>Cloudera介绍?jin)他们新的流式计框Ӟ有文章介l了(jin)Apache Flink的流?/span>SQLQ?/span>DataTorrent介绍?/span>Apache Apex定w机制Q还?/span>Concordq样新的式计算框架Q另外还?/span>Apache Kafka?/span>0.10版。其他新L面,Apache孵化器有新动?/span>——Apache TinkerPop?/span>Apache Zeppelin孵化成ؓ(f)目Q?/span>Tephraq入孵化器。除?jin)上q内容,Apache Spark?/span>Apache HBase?/span>Apache Drill?/span>Apache Ambari{也有新文章?/span>
技术新?/span>
DataTorrent博客撰文介绍?/span>Apache Apex在读写数据文件时的容错机制?/span>Apex是专门处理流式数据的Q流式计有一些微妙但重要的细节需要考虑。例如?/span>HDFS输出ӞHDFS的租U机制会(x)引发问题?/span>
https://www.datatorrent.com/blog/fault-tolerant-file-processing/
Databricks博客介绍?/span>Spark 2.0?/span>Tungsten代码生成引擎带来的性能提升。博文D例说明了(jin)׃虚拟函数的管理,更好地利?/span>CPU寄存器和循环展开Q所以代码生成引擎能更快的生成代码。除?/span>Databricks的博文外Q?/span>Morning Paperq谈C上技术其实是受到VLDB论文的启发?/span>
https://blog.acolyer.org/2016/05/23/efficiently-compiling-efficient-query-plans-for-modern-hardware/
StreamScope是微软流式处理系l,?/span>Morning Paper本周撰写的另一个流式计文章。介l了(jin)该系l的特征——吞吐?/span>/集群大小、编E模?/span>(SQL)、时间模型、语义学/保证Q以?qing)微软品中的应用?/span>
https://blog.acolyer.org/2016/05/24/streamscope-continuous-reliable-distributed-processing-of-big-data-streams/
Apache博客撰文介绍?/span>HubSpot团队?/span>Apache HBase?/span>G1GC调优斚w的经验。本文回?/span>HubSpot如何试和保障稳定性、如何保?/span>99%的性能、如何羃短花在垃圑֛收上的时间。该团队使用很多技巧,很好地决l了(jin)错综复杂?/span>GC法。本文最后,q(sh)步步C?/span>HBase?/span>G1GC调优?/span>
https://blogs.apache.org/hbase/entry/tuning_g1gc_for_your_hbase
LinkedIn撰文阐述?jin)调?/span>Kafka偏移量管理问题的诸多困难。本文聚焦了(jin)两个所?/span>"offset rewind"事g的症Ӟ如何在监控过E中(g)到q类事gQ以?qing)导致这两个事g的根本原因(?qing)解x(chng)案)(j)?/span>
https://engineering.linkedin.com/blog/2016/05/kafkaesque-days-at-linkedin--part-1
Databricks博客发布?jin)?/span>Apache Sparkq行基因变异分析pd文章的第三部分也是最后一。本文从准备Q把文g转换?/span>Parquetq加载进Spark RRDQ到如何加蝲基因型数据再到运?/span>kmeans聚类法Z基因型特征预地理种?/span>
https://databricks.com/blog/2016/05/24/predicting-geographic-population-using-genome-variants-and-k-means.html
许多批处理大数据生态系l已从自定义API回到SQL上,所以如果流式处理框架也发生?jin)同L(fng)变化Q一定很有趣。本文,Apache Flink团队介绍他们计划支持式SQL?/span>Flink已经有了(jin)Table APIQ他们利?/span>Apache Calcite提供?jin)?/span>SQL的支持。对?/span>windowingQ他们计划用Calcite的流?/span>SQL扩展。最初对SQL的支持将?/span>1.1.0版中体现Q在1.2.0版加强?/span>
http://flink.apache.org/news/2016/05/24/stream-sql.html
本文介绍?/span>Apache Drill?/span>XML插g。尽还没有?/span>Drill集成在一P但它相当Ҏ(gu)被编译成jar和配|对XML的支持?/span>
https://www.mapr.com/blog/how-use-xml-plugin-apache-drill
Hortonworks博客略介l了(jin)Ambari监控度量pȝ的架构,最q加入了(jin)Grafana作ؓ(f)其前端A表盘。该pȝ使用Apache Phoenix?/span>Apache HBase作ؓ(f)存储支撑Q所以是可以横向扩展的?/span>
http://hortonworks.com/blog/hood-ambari-metrics-grafana/
q篇教程介绍?jin)怎样?/span>Amazon EMR上?/span>Spark SQL?/span>Hue?/span>Apache Zeppelin配合q行SQL查询存储?/span>S3中跨制表W分割的数据。本文最后展CZ(jin)如何?/span>Spark?/span>DynamoDB存储数据?/span>
http://blogs.aws.amazon.com/bigdata/post/Tx2D93GZRHU3TES/Using-Spark-SQL-for-ETL
Heroku团队分n?jin)他们用最新版Apache Kafka的体?/span>——才引入的timestamp字段Q?/span>8字节Q会(x)D一些反直觉的性能变化?/span>
https://engineering.heroku.com/blogs/2016-05-27-apache-kafka-010-evaluating-performance-in-distributed-systems/
其他新闻
O'Reilly数据播客U?/span>Spark 2.0中结构化式计算斚w的问题采访了(jin)来自Databricks?/span>Michael Armbrust。网站上的一文章选择引用?jin)其中的话?/span>—— Spark SQL、结构化式计算的目标、端到端道的保证、对在线处理q用Spark机器学习(fn)法?/span>
https://www.oreilly.com/ideas/structured-streaming-comes-to-apache-spark-2-0
本周两个大数据项目从Apache孵化器孵化完?/span>——Apache TinkerPop?/span>Apache Zeppelin?/span>TinkerPop是图计算框架Q?/span>Zeppelin是面向数据分析基?/span>web?/span>notebook?/span>
https://blogs.apache.org/foundation/entry/the_apache_software_foundation_announces91
https://blogs.apache.org/foundation/entry/the_apache_software_foundation_announces92
TephraQ?/span>HBase的事务引擎进入了(jin)Apache孵化器?/span>Tephra最初由Cask的团队创建,目前仅和Apache Phoenixq行?jin)集成?/span>
http://blog.cask.co/2016/05/tephra-a-transaction-engine-for-hbase-moves-to-apache-incubation/
TechRepublic撰文介绍?/span>Concord.ioQ一个由C++开发的式处理框架。旨在填补高性能式计算?jng)场的空~?/span>
http://www.techrepublic.com/article/could-concord-topple-apache-spark-from-its-big-data-throne/
产品发布
Apache Avro本周发布?/span>1.8.1版。修复了(jin)过20?/span>bug和一些其它进步?/span>
http://mail-archives.us.apache.org/mod_mbox/www-announce/201605.mbox/%3CCAO4re1nYMm79WQ2LUeODWjHmJ9EiYOF=mty6p2aiq-S_4R95iQ@mail.gmail.com%3E
Confluent发布?jin)基?/span>librdkafka开发的Kafka Python客户端?/span>
https://pypi.python.org/pypi/confluent-kafka/0.9.1.1
伴随着新的Kafka 式计算方式Q?/span>Apache Kafka 0.10版发布了(jin)。新版本支持?jin)机架感知和消息中?/span>timestampQ提升了(jin)SASL?/span>Kafka Connect{?/span>
http://mail-archives.us.apache.org/mod_mbox/www-announce/201605.mbox/%3CCAPuboUuRyCRxDp5CLjv2yVM77SpYFF+HdnBeiiyeumYTJNpY4g@mail.gmail.com%3E
Confluent发布?jin)基?/span>Apache Kafka 0.10?/span>Confluent Platform 3.0版。除?/span>Kafka的核?j)特性,Confluent Platformq有一个商业组件ؓ(f)Kafka Connect提供配置工具和端到端监控?/span>
http://www.confluent.io/blog/announcing-apache-kafka-0.10-and-confluent-platform-3.0
Apache KylinQ大数据OLAP引擎Q发布了(jin)1.5.2版。作Zơ补丁的发布,1.5.2有不新Ҏ(gu)?/span>/提升/bug修复Q包括支?/span>CDH 5.7?/span>MapR?/span>
http://mail-archives.us.apache.org/mod_mbox/www-announce/201605.mbox/%3CCA+LQBaTDxb4wVYVvtOC22gMbJ0p9cvhAWzEY_x2n1oNGvEDPSQ@mail.gmail.com%3E
Twitter开源了(jin)他们的流式处理系l?/span>Heron?/span>Heron?/span>Twitter用于替换Apache Storm的品,发力点在性能、调试以?qing)开发h员生产率?/span>
https://blog.twitter.com/2016/open-sourcing-twitter-heron
Envelope是来自于Cloudera Labs的新目Q它提供?jin)基于配|文件的式ETL处理q程。构建在Spark streaming之上Q?/span>Envelope最q正在研发面?/span>Kafka?/span>Kudu的连接器?/span>
http://blog.cloudera.com/blog/2016/05/new-in-cloudera-labs-envelope-for-apache-spark-streaming/
zd
中国
Spark Meetup 4 (杭州) – 周日, 6?/span>5?/span>
http://www.meetup.com/Hangzhou-Apache-Spark-Meetup/events/231071384/
Hadoop周刊 W?/span> 171 ?/span>
启明星辰q_和大数据Ml编?/span>
2016q?/span>5?/span>22?/span>
本周Q包?/span>LinkedIn新开源项目在内的几个目都有版本发布。在技术新d其他新闻斚wQ多文章回了(jin)Apache: Big Data North America?x)议Q另外有一l跨多个不同数据系l分析纽U出UR数据的系列文章?/span>
技术新?/span>
Databricks博客分析?/span>Apache Spark中两UD法。之一Q?/span>“approxCountDistict”是用来评C同值的数量Q之二,“approxQuantile”用于生成D癑ֈ比。本文介l了(jin)法和可视化_ֺ不同的残(hu)差?/span>
https://databricks.com/blog/2016/05/19/approximate-algorithms-in-apache-spark-hyperloglog-and-quantiles.html
本教E描qC(jin)如何使用Apache Hadoop HDFS?/span>Apache Solr?/span>Hue存储、烦(ch)引、查?/span>DICOM格式的医学媄(jing)像。文章诏I了(jin)加蝲和获取数据的整个步骤?/span>
http://blog.cloudera.com/blog/2016/05/how-to-process-and-index-medical-images-with-apache-hadoop-and-apache-solr/
MapR Streams是一?/span>API兼容Apache Kafka的系l。本文在宏观上比较了(jin)MapR Streams?/span>Kafka的异同。同旉明了(jin)Kafka Streams怎样?/span>MapR Streams扯上关系的?/span>
https://www.mapr.com/blog/apache-kafka-and-mapr-streams-terms-techniques-and-new-designs
本文在我看来是最清晰介绍Paxos的文章之一Q?/span>Paxos为分布式pȝ构徏?jin)一致性协议。本文用l图计算机和分布式拍卖示范了(jin)q个协议?/span>
http://ifeanyi.co/posts/understanding-consensus/
ZApache: Big Data North America?x)议上的一演讌Ӏ?/span>DatanamiH探?jin)即发布?/span>Apache Hadoop 3的新Ҏ(gu)。包括,shell脚本重写、Q务集本地优化、内存大自动~能力、支?/span>HDFS erasure codings。本文着重在erasure codings上,文章密切x(chng)?/span>erasure codings在存储效率方面的提升Q?/span>3x盘消耗降低到1.5xQ?/span>
http://www.datanami.com/2016/05/18/hadoop-3-poised-boost-storage-capacity-resilience-erasure-coding/
q篇演讲来自?/span>PyData柏林?x)议Q描qC(jin)Apache Arrow?/span>Feather文g格式Q探I了(jin)数据在跨语言/框架互操作性的工作机制?/span>
http://www.slideshare.net/wesm/python-data-ecosystem-thoughts-on-building-for-the-future
发布?jin)两个来自于不同会(x)议?/span>Apache Kafka有关的演讲视频。第一个讨Z(jin)Kafka的安全特性,W二个探索了(jin)Kafka如何跨系l共享数据?/span>
https://www.oreilly.com/learning/securing-apache-kafka
https://www.infoq.com/presentations/event-streams-kafka
q篇博客集成?jin)数利?/span>Amazon Redshift?/span>Google BigQuery?/span>Postgres?/span>Presto数据pȝ加蝲/查询U约出租车数据的文章。除?jin)原始基准测试,q详l介l了(jin)如何处理故障、优化、比较替代方案(AWS?/span>S3?/span>HDFS比)(j)?/span>
http://tech.marksblogg.com/all-billion-nyc-taxi-rides-redshift.html
O'Reilly撰文介绍?jin)通过Kafka?/span>Flink?/span>Elasticsearch?/span>Kibana怎样实现kappa架构。文章概qC(jin)lambda?/span>kappa架构Q介l了(jin)主要的架构组Ӟ以及(qing)怎样讄使用贝叶斯模型发现新奇事物?/span>
http://www.oreilly.com/ideas/applying-the-kappa-architecture-in-the-telco-industry
其他新闻
本文列D?jin)最q在Apache: Big Data North America?x)议上提到的几个大数据生态系l项目。有不少是我们没U_视线的内宏V?/span>
http://www.datanami.com/2016/05/11/open-source-tour-de-force-apache-big-data-2016/
Pivotal博客有一关于大数据和敏捷开发有的文章。大数据pȝ往往停留在非敏捷的世界,例如在装载数据前需求要攉CQ模型要定义好。本文认为,没有在云环境中经q长期验证的话,要对q种方式q行U束Q有限的能力和性能、竖井式数据{)(j)?/span>
https://blog.pivotal.io/big-data-pivotal/features/when-it-comes-to-big-data-cloud-and-agility-go-hand-in-hand
Databricks发布?jin)他们记录的|络?x)议视?/span>“Apache Spark MLlib: From Quick Start to Scikit-Learn”。除?jin)视频内容,他们q在?x)议中解{了(jin)八个常见问题?/span>
https://databricks.com/blog/2016/05/18/spark-mllib-from-quick-start-to-scikit-learn.html
Hortonworks博客回顾?/span>Apache Storm的历双Ӏ?/span>2011q开源,2013q进?/span>Apache孵化器,2014q成为顶U项目,今年初发布了(jin)1.0版。本文论qC(jin)每个里程的主要技术进步?/span>
http://hortonworks.com/blog/brief-history-apache-storm/
HBaseCon本周在旧金山召开。这ơ会(x)议,Apple?/span>Yahoo?/span>Facebook都有演讲材料?/span>
MapR发图?jin)祝了(jin)过Mq中Apache Drill取得的成l。一q中发布?/span>7个版本,完成?jin)多个里E碑?/span>
https://www.mapr.com/blog/happy-anniversary-apache-drill-what-difference-year-makes
Datanami发布?jin)?/span>Apache: Big Data North America?x)议上?/span>ASFȝJim Jagielski?/span>ODPi目ȝJohn Mertic的问{录Q如大家所料,主要话题q是ASF?/span>ODPi的关pR?/span>
http://www.datanami.com/2016/05/20/apache-foundation-keeps-eyes-wide-open-odpi/
产品发布
LinkedIn开源了(jin)AmbryQ他们的ObjectStore分布式系l?/span>Ambry代码已提交到githubQ这博文介l了(jin)Ambry的服务承诺,设计目标Q体pL构和接口?/span>
https://engineering.linkedin.com/blog/2016/05/introducing-and-open-sourcing-ambry---linkedins-new-distributed-
?/span>apache HAWQQ孵化中Q驱动的Pivotal HDB 本周发布?/span>2.0版,HDB?/span>Hadoop提供?jin)分析数据库?/span>
https://blog.pivotal.io/big-data-pivotal/products/fail-fast-and-ask-more-questions-of-your-data-with-hdb-2-0
Apache Mahout本周发布?/span>0.12.1版,Mahout是一个机器学?fn)和数据挖掘pȝ。本ơ发布旨在推q?/span>Flink?/span>Mahout的集成?/span>
http://mail-archives.us.apache.org/mod_mbox/www-announce/201605.mbox/%3CCAOtpBjhshagyLN3Qnt0xRnc7YbnMVJjTS4piVXL7LiS2pQguXw@mail.gmail.com%3E
Apache Tajo发布?/span>0.11.3版?/span>Tajo?/span>Hadoop的数据仓库。本ơ发布修正了(jin)5?/span>bug?/span>
http://tajo.apache.org/releases/0.11.3/announcement.html
MongoDB?/span>Apache Spark发布?jin)新?/span>MongoDB Connector。除?jin)对?/span>Spark?/span>Hadoop InputFormat shim外,?/span>Connectorq有其他Ҏ(gu)。最后,q解释了(jin)MongoDB一些关键特性?/span>
http://rosslawley.co.uk/introducing-a-new=mongodb-spark-connector/
SyncSort发布?/span>DMX-h v9Q支?/span>Kafka以及(qing)新的执行框架?/span>
http://insidebigdata.com/2016/05/20/syncsorts-latest-innovations-simplify-integration-of-streaming-data-in-spark-kafka-and-hadoop-for-real-time-analytics/
zd
中国
?/span>
Hadoop周刊 W?/span> 169 ?/span>
启明星辰q_和大数据整体l编?/span>
2016q?/span>5?/span>8?/span>
本周内容短小_。主题覆?/span>Apache Beam?/span>MapR季度业W、最q的KafkaC(x)Q以?qing)来?/span>Cloudera新开源的分布式单元测试框架?/span>
技术新?/span>
Elastic分析?jin)宕Z件的Ҏ(gu)。错误配|?/span>ZooKeeper内存讄?x)引赯度?/span>GCQ这从Ҏ(gu)上导?/span>ZooKeeper集群丢失。文章介l了(jin)一些缓解策略,用来防止未来cM问题的发生?/span>
https://www.elastic.co/blog/elastic-cloud-outage-april-2016
Cask博客明扼要的归纳?jin)最q?/span>Big Data Applications Meetup的花i。首先出场的?/span>PachydermQ它ZDocker容器提供“数据Git”语义。第二个出场的是TubeMogul大数据^収ͼTubeMogul构徏?/span>Hadoop?/span>Hive?/span>Spark?/span>Presto之上?/span>
http://blog.cask.co/2016/05/pachyderm-and-tubemogul-share-their-big-data-application-platforms-and-experience/
Google?/span>dataArtisans同时撰文介绍?/span>Apache BeamQ前生是Google Dataflow SDKQ?/span>Google的文章解释了(jin)Z开源和开?/span>Beam的动机,dataArtisans的文章介l他们对Beam模型的支持以?qing)怎样考虑Flink?/span>Beam API之间的关pR?/span>
https://cloud.google.com/blog/big-data/2016/05/why-apache-beam-a-google-perspective
http://data-artisans.com/why-apache-beam/
IBM Hadoop dev博客有个关于安装Python?/span>Scala和ؓ(f)Jupyter notebook嵌入R内核的操作说明。同Ӟ也说明了(jin)怎样q接Spark和通过SSL暴露notebook?/span>
https://developer.ibm.com/hadoop/blog/2016/05/04/install-jupyter-notebook-spark/
本文介绍?/span>Mongo Hadoop的连接函数是如何HvSpark?/span>MongoDB的?/span>
https://x.ai/using-the-mongo-hadoop-connector-as-a-translation-layer-to-spark/
Qubole博客撰文比较?jin)用于大数据分析的流行编E语a—Python?/span>R?/span>Scala?/span>
http://www.qubole.com/blog/big-data/programming-language/
其他新闻
MapR宣布本季度他们授权下单创U录的增长了(jin)99%Q以?/span>146%的美元净增长率?/span>
https://www.mapr.com/company/press-releases/mapr-achieves-another-record-quarter-99-software-subscription-license-growth
本文描述?jin)最q?/span>Google Cloud Dataflow?/span>Apache Spark?/span>Google Compute Engine上的基准试表现?/span>Dataflow胜过Spark2Q?/span>5.7倍(一直以来,最好是在自q环境下评估工作负载,而不是一味的信Q基准试Q。本文还解释?jin)一U?/span>“h”Q通过它每个使用大数据工L(fng)益?/span>
http://www.datanami.com/2016/05/02/dataflow-tops-spark-benchmark-test/
Confluent博客回顾?jin)最q召开?/span>KafkaC(x)Q包括编E挑(xi)战预选赛Q主题演Ԍ分组?x)议{等?/span>
http://www.confluent.io/blog/log-compaction-kafka-summit-edition-may-2016
布斯介l了(jin)国q通在q去5q间采用大数据技术的历程。本文中Q美国运通分享了(jin)一些技巧和学到的经验教训,例如采用新技术的困难Q得到组l高层的认同是多么的重要Q,以及(qing)雇䄦和留住工E师的挑(xi)战等{?/span>
http://www.forbes.com/sites/ciocentral/2016/04/27/inside-american-express-big-data-journey/
产品发布
Cask发布?/span>Cask Data Application Platform (CDAP)3.4版本?/span>新版本增加了(jin)Cask TrackerQ新的数据集?/span>/审计/搜烦(ch)pȝQ升U了(jin)Cask Hydrator?/span>UIQ增Z(jin)?/span>Spark的支持等{?/span>
http://blog.cask.co/2016/05/announcing-cdap-release-3-4-introducing-tracker-next-gen-hydrator-enhanced-spark-support-and-much-more/
Cloudera开源了(jin)“dist_tes”Qƈ行执行单元测试的新工兗通过该工P?/span>Hadoop?/span>Kudu目q行单元试Q可以在数分钟而不是数时内完成。该工具l定?/span>C++?/span>JavaQƈ在网站上演示?jin)这些特性?/span>
http://blog.cloudera.com/blog/2016/05/quality-assurance-at-cloudera-distributed-unit-testing/
Google宣布Google BigQuery?/span>Drive可集成在一P把输Z存到Google sheets?/span>
http://techcrunch.com/2016/05/06/google-connects-bigquery-to-google-drive-and-sheets/
zd
中国
?/span>
Hadoop周刊 W?/span> 168 ?/span>
启明星辰q_和大数据整体l编?/span>
2016q?/span>5?/span>1?/span>
KafkaC(x)本周在旧金山召开Q不容置疑本周期刊将有大量的Kafka内容。除此以外,q有大量关于Impala性能?/span>Kudu?/span>Druid斚w的文章。在其他新闻部分Q?/span>Apache Apex成ؓ(f)?/span>Apache的顶U项目,Qubole开源了(jin)?/span>StreamX目?/span>
技术新?/span>
本文快速浏览了(jin)如何在可能或不可能创建新数据分区的情况下操作Spark RDD。尤?/span>`mapValues`?/span>`filter`?x)保存分?/span>`map`却不?x)?/span>
https://medium.com/@corentinanjuna/apache-spark-rdd-partitioning-preservation-2187a93bc33e
本文介绍?jin)如何?/span>Conda构徏独立?/span>Python环境Q例?/span>pandas插gQ,以便做ؓ(f)Spark job的一部分装蝲到集节炏V经q这L(fng)处理Q就能在没有python原生包被安装在主操作pȝ上的情况下运?/span>PySpark job。这U方案同样适用?/span>SparkR?/span>
http://quasiben.github.io/blog/2016/4/15/conda-spark/
Datadog博客有三监?/span>Kafka的系列文章。第一详l概括了(jin)broker?/span>producer?/span>consumers?/span>ZooKeeper的关键度量指标。第二篇介绍?jin)怎样?/span>JConsole和其他工具上通过JMX查看指标Q第三篇介绍?/span>Datadog集成斚w的知识?/span>
https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/
Salesforce撰文介绍?/span>Kafka在他们组l内的成长史。最初,他们借助Kafka驱动?jin)操作指标分析功能,渐渐地成Z个驱动众多系l的大^台?/span>Salesforceq用Kafka在多个数据中?j)运行,q?/span>MirrorMaker在集间复制和聚合数据?/span>
https://medium.com/salesforce-engineering/expanding-visibility-with-apache-kafka-e305b12c4aba#.5k7j921o3
Metamarkets博客有一关于优化大规模分布式系l的有趣博文?/span>DruidQ他们的分布式数据仓库,最q增加了(jin)一U?/span>"先进先出"的查询模式,q在重型负蝲大集间q行?jin)测试。根据他们的假设Q推Q何可能发生和攉到有的的指标?/span>
https://metamarkets.com/2016/impact-on-query-speed-from-forced-processing-ordering-in-druid/
Google Cloud Big Data博客撰文介绍?/span>BigQuery的内部存储格式,容器Q以?qing)其它得存储数据更有效率的优化措施?/span>
https://cloud.google.com/blog/big-data/2016/04/inside-capacitor-bigquerys-next-generation-columnar-storage-format
Apache KuduQ孵化中Q博客概qC(jin)最q?/span>YCSB工具对系l性能分析和调优的l果?/span>
http://getkudu.io/2016/04/26/ycsb.html
Impala 2.5无论?/span>TPC基准试q是其它斚w均有显著的性能提升。提升项包括q行时过滤器Q?/span>LLVM代码生成器对`SORT`?/span>`DECIMAL`的支持,更快?/span>metadata-only查询Q等{?/span>
http://blog.cloudera.com/blog/2016/04/apache-impala-incubating-in-cdh-5-7-4x-faster-for-bi-workloads-on-apache-hadoop/
本文介绍?jin),为支持高可用性,如何?/span>Hive Metastore配置MariaDB的?/span>
https://developer.ibm.com/hadoop/blog/2016/04/26/bigsql-ha-configure-ha-hive-metastore-db-using-mariadb10-1/
Altiscale博客撰文介绍?jin)寻?/span>NodeGroup相关bug的过E(跟进三月的文章)(j)。如果你因没扑ֈHadoopQ或其他分布式系l)(j)?/span>bug根结而气馁,不要Ҏ(gu)。本文告诉你q的困难,甚至需要程序员在销?/span>Hadoop服务的企业干zL能搞定?/span>
Netflix现在q行?jin)超q?/span>4000?/span>Kafka brokerQ横?/span>36个集。在云中q行Kafka需要一些权衡,团队q?jin)开销和数据丢失(日数据丢失小?/span>0.01%Q。本文分享了(jin)团队?/span>AWS中运?/span>Kafka的经验,主要是一些典型问题,部v{略Q小集群、隔ȝzookeeper集群Q,集群U容错,支持AWS availability zonesQ?/span>Kafka UI可视化等{?/span>
http://techblog.netflix.com/2016/04/kafka-inside-keystone-pipeline.html
Amazon大数据博客撰文介l了(jin)如何?/span>Amazon EMR加密数据存放?/span>S3中。这U集成方式同时支持客L(fng)和服务器端加密(借助?/span>Amazon KMSQ?/span>
http://blogs.aws.amazon.com/bigdata/post/TxBQTAF 3X7VLEP/Process-Encrypted-Data-in-Amazon-EMR-with-Amazon-S3-and-AWS-KMS
TubeMogul介绍?jin)他们大数据q_的历Ԍ该^台每月支撑万亿次数据分析h。该团队很早p?/span>Amazon EMRQ导入了(jin)Storm实时处理技术,最l把大数据服务落在了(jin)Qubole上?/span>
https://www.tubemogul.com/engineering/the-big-data-lifecycle-at-tubemogul/
CaffeQ深度学?fn)框Ӟ?/span>Sparkq行?jin)集?/span>—CaffeOnSpark?/span>MapR公司撰文介绍?jin)如何?/span>MapR YARN上运行,文章q包括了(jin)采用的性能优化手段?/span>
https://www.mapr.com/blog/distributed-deep-learning-caffe-using-mapr-cluster
其他新闻
Apache ApexQ大数据?hu)式处理和批处理pȝQ现在成Z(jin)Apache软g基金?x)的目?/span>Apexd8月进入孵化器?/span>
https://blogs.apache.org/foundation/entry/the_apache_ software_foundation_announces90
Heroku KafkaQ是一个分支于Heroku?/span>Kafka理服务。最q接q发?/span>beta版?/span>
https://blog.heroku.com/archives/2016/4/26/announcing-heroku-kafka-early-access
MapR博客上的一文章强调ؓ(f)什么性别多样性是重要的,q提C(jin)大数据论坛中的女性,本文旨在鼓励x(chng)投w于q一领域?/span>“大数据论坛中的女?/span>”研讨?x)本周?/span>MapRl织在圣何塞召开?/span>
https://www.mapr.com/blog/case-women-big-data
产品发布
StreamX是一个来?/span>Qubole的开源项目,它能?/span>Kafka拯数据?/span>Amazon S3q样的目标存储中?/span>Qubole?/span>StreamX作ؓ(f)一U管理服务提供?/span>
http://www.qubole.com/blog/big-data/streamx/
SnappyData是一个ؓ(f)OLAP?/span>OLTP查询式数据的新q_Q和公司Q?/span>SnappyData?/span>Apache Spark?/span>GemFire的内存存储技术驱动?/span>
Apache GeodeQ孵化中Q发布了(jin)1.0.0-incubating.M2版本Q它是一个分布式数据q_Q瞄准高性能和低延迟。新版本提供?jin)广域网下的点对点连接等新特性?/span>
http://mail-archives.apache.org/mod_mbox/incubator-geode-dev/201604.mbox/%3CCAFh%2B7k2eiK2TMGK sLqrY9CZDjxjYwiuTQ4QGUVC2s3geyJYwnA% 40mail.gmail.com%3E
Apache Knox发布?/span>0.9.0版,它是Hadoop?/span>REST API|关。新版本?/span>Ranger?/span>Ambari提供?/span>UI界面支持Q以?qing)一些其它的提升?/span>bug修复?/span>
http://mail-archives.us.apache.org/mod_mbox/www-announce/201604.mbox/%3CCACRbFyjRF7zShb-NQ29d3FJ0hKZ57ts0Qfo31ffuNODpskwqPQ @mail.gmail.com%3E
zd
中国
?/span>
Ƣ迎来到Hadoop周刊周一特别版。本周有大量来自Spark?/span>Kafka?/span>Beam?/span>Kudu的技术新闅R如果你正在L一些更前沿的技术,Apache MetronQ孵化中Q发布了(jin)它们W一个版本?/span>MetronQ是一个构建在Hadoop上正在不断发展的通用安全pȝ?/span>
技术新?/span>
本文介绍?jin)如何?/span>AWS上构建流式处理系l。包括了(jin)诸如Amazon Kinesis ?/span>AWS Lambda?/span>Kineses S3 connector之类单的搭配Ҏ(gu)Q也介绍?/span>AWS实现实时分析场景q样相对复杂点的Ҏ(gu)?/span>
本文介绍?jin)怎样使用Spark Testing Base?/span>Spark Testing Base是一个用Scala~写Q通过Java调用?/span>Spark试框架。本文的样例代码展示?jin)如何隔L试逻辑重构Spark代码Q同时还通过Java处理?jin)一些臃肿的Scala API?/span>
http://www.jesse-anderson.com/2016/04/unit-testing-spark-with-java/
Altiscale博客概述?jin)?/span>Spark环境下,构徏thin?/span>uber jar包的优劣。示范了(jin)?/span>Maven?/span>SBT分别构徏两种包的情况?/span>
https://www.altiscale.com/blog/spark-on-hadoop-thin-jars/
LinkedIn介绍?jin)他们?/span>Kafka生态系l,生态系l包含一个特D的Kafka producerQ一个ؓ(f)?/span>Java客户端提供的REST APIQ一?/span>avro模式注册表,以及(qing)GobblinQ装载数据到Hadoop的工P(j){等?/span>
https://engineering.linkedin.com/blog/2016/04/kafka-ecosystem-at-linkedin
?/span>Spark Streaming教程介绍?jin)怎样通过twitter4j API拉推文,Z标签qo(h)Q对推文q行情感分析?/span>
https://www.mapr.com/blog/spark-streaming-and-twitter-sentiment-analysis
Apache KuduQ孵化中Q是Apache ImpalaQ孵化中Q的l佳伴GQ因为它能高效地解决q泛的分析和有针Ҏ(gu)的查询。本文描qC(jin)两者集成的技术细节,例如Kudu的设计如何保证高效地查询能力Q如何通过Impala?/span>Kudu执行写/更新Q删除操作等{?/span>
http://blog.cloudera.com/blog/2016/04/how-to-use-impala-and-kudu-together-for-analytic-workloads/
MapR撰文介绍?jin)?/span>spark-sklearn扩展一个已存在?/span>scikit-learn模型。文章介l了(jin)如何透过Airbnb数据集内部徏模,q(sh)l了(jin)如何傍着spark-sklearnq行交叉验证?/span>
https://www.mapr.com/blog/predicting-airbnb-listing-prices-scikit-learn-and-apache-spark
AWS大数据博客写?jin)个如何?/span>Amazon EMR中?/span>HBase?/span>Hive的教E。本教程介绍?/span>HBaseQ描qC(jin)如何?/span>S3中恢?/span>HBase表,C?/span>Hive?/span>HBase如何集成{等?/span>
本文描述?jin)?f)学生在大数据评上提供实战经验的?xi)战。作者经历若q次的P代和选择g有了(jin)一个好Ҏ(gu)— Altiscale?/span>Hadoop-as-a-Service?/span>
https://www.altiscale.com/blog/hadoop-as-a-service-in-the-classroom/
Cloudera博客的一客做文章,作者比较了(jin)Parquet?/span>Avro在跨两个数据集的不同处理方式Q一个数据集H?/span>(3?/span>)、一个数据集?/span>(103?/span>)Q。在?/span>Spark?/span>Spark SQL试查询Q操作后Q作者发?/span>Parquet?/span>Avro在查询序列化数据斚w有时表现很类|管在大多数情况下查?/span>Parquet数据的时候更快点Q序列化数据更小Q?/span>
http://blog.cloudera.com/blog/2016/04/benchmarking-apache-parquet-the-allstate-experience/
本文介绍?jin)如何?/span>CDHq样的分布式环境中?/span>SparkRQ尽?/span>SparkR官方q没有支持这U方式。借助YARN?/span>worker本地安装R语言包,jobE加攚w就能执行了(jin)?/span>
http://www.nodalpoint.com/sparkr-in-cloudera-hadoop/
很多开源框枉能执?/span>MapReduce以及(qing)借助更高U的~程模型完成cM的工作。纵观过去,它们依赖独立q行的框Ӟ例如MapReduce, StormQ,但是最q的某些变化使得q一切充满了(jin)变数?/span>Apache BeamQ孵化中Q更q一步地跨越?jin)批处理、流式处理两U执行模式,内置更加复杂的计模型?/span>
http://www.datanami.com/2016/04/22/apache-beam-emerges-ambitious-goal-unify-big-data-development/
Apache博客发布?/span>HBase?/span>HDD?/span>SSD以及(qing)RAMDISK上的写入性能试比对?/span>7系列文章。通过q一分析Q作者发现ƈ提议?/span>HBase?/span>HDFS上实C些未覆盖的功能?/span>
https://blogs.apache.org/hbase/entry/hdfs_hsm_and_hbase_part
其他新闻
Tom WhiteQ?/span>“Hadoop权威指南”的作者撰文介l他是如何步?/span>Apache HadoopD堂的。他的早期A(ch)献是l着Hadoop?/span>Amazon Web Services集成展开Q而今AWS已成?/span>Hadoop目成功的重要部分?/span>
http://vision.cloudera.com/how-i-got-into-hadoop/
FluoQؓ(f)Apache Accumulo准备的分布式处理引擎Q向Apache孵化器提交了(jin)孵化甌?/span>
https://wiki.apache.org/incubator/FluoProposal
Apache Phoenix宣布?yu)?/span>HBaseCon后D行会(x)议,Apache Phoenix是一?/span>SQL-on-HBasepȝ。该?x)议只有半天Q主题是介绍Phoenix内部情况和用例?/span>
http://hortonworks.com/blog/announcing-first-annual-phoenixcon-apache-phoenix-user-conference/
产品发布
Apache MetronQ构ZHadoop上的安全框架Q发布了(jin)0.1版?/span>Hortonworks支撑其作为技术预览版Qƈ撰写本文介绍?jin)如何上手,如何贡献Q如何?/span>Metron UI{等?/span>
http://hortonworks.com/blog/apache-metron-tech-preview-1-come-get/
http://hortonworks.com/blog/apache-metron-use-case-finding-needle-haystack/
Apache NiFi本周发布?/span>0.6.1版。这是修复了(jin)10多个bug后的修复版?/span>
Apache Flink本周发布?/span>1.0.2版。本ơ发布包括了(jin)bug修复Q?/span>RocksDB环境下的性能提升以及(qing)一些文档方面的q步?/span>
http://flink.apache.org/news/2016/04/22/release-1.0.2.html
Amazon发布?jin)新?/span>Amazon EMRQ开始支?/span>HBase 1.2?/span>
https://aws.amazon.com/blogs/aws/amazon-emr-update-apache-hbase-1-2-is-now-available/
zd
中国
?/span>
2016q?/span>4?/span>17?/span>
启明星辰——q_和大数据整体l编?nbsp;
Hortonworks在本?/span>HadoopƧ洲C(x)上有若干爆料Q诏I了(jin)本期整个内容。伴随着?jing)h的新Ҏ(gu),Apache Storm发布?/span>1.0.0版。在技术新L面,有不基?/span>Kafka构徏大规模服务和分布式系l测试的文章。如果你错过?/span>HadoopC(x)Q那么不用担?j),演讲视频已经攑ֈ了(jin)网上?/span>
技术新?/span>
Smyte撰文介绍?jin)他们基于事件数据流实时(g)垃N件和诈骗信息的基设施。最初的事g处理pȝ构徏?/span>Kafka?/span>Redis?/span>Secor以及(qing)S3上,Z(jin)满规模不断扩张和廉L(fng)要求Q他们把pȝq移到基于磁盘的Ҏ(gu)上,使用Redis协议?/span>RocksDB交互Q?/span>Kafkaq行复制?/span>
https://medium.com/the-smyte-blog/counting-with-domain-specific-databases-73c660472da
本文?/span>rsyslog?/span>Kafka?/span>AWS ?/span>ELK栈(ElasticSearch?/span>Logstash?/span>KibanaQ结合,处理诸如反压、规模以?qing)维护方面的问题。本文覆盖了(jin)rsyslog集成Kafka以及(qing)schema斚w的技巧,也介l了(jin)如何q行Kafka?/span>Zookeeper以及(qing)AWS中大规模自动分组?/span>
https://www.bashton.com/blog/2016/elk-on-ark/
Hortonworks撰文介绍?/span>Apache Atlas以及(qing)Apache Range要引入的数据管理特性。这些特性是Q分c访问控制、数据有效期{略、位|特性策略、禁止数据集l合、跨lg家族Q例如从Kafka?/span>Storm再到Hive的数据跟t)(j)?/span>
http://hortonworks.com/blog/the-next-generation-of-hadoop-based-security-data-governance/
Apache HAWQ Q孵化中Q是一个基?/span>Greenplum?/span>HDFS上提供数据查询的SQL引擎。本文讨Z(jin)其典型设计以?qing)新版本的诸多改q。包括它?/span>Spark?/span>MapReduce的区别,q有?/span>Hadoop?xi)战l典MPP设计的内容,以及(qing)HAWQ的新设计怎样l合MPP和批处理技术进而其两者兼?/span>
Cloudera博客撰文介绍?jin)?/span>Hadoop分布式系l进行故障注入、组|的试工具AgenTEST。它能注入网l故障(例如丢包Q,资源满蝲Q例?/span>CPU?/span>IO、磁盘空_(d)(j){等。当试|络分区Ӟ可以评估环Şl网、桥接组|等{?/span>
Hortonworks博客展望?jin)将包含新版?/span>Spark?/span>Zeppelin?/span>HDP 2.4.2?/span>Spark2.0预览版和Zeppelin新特性都包含在内?/span>
http://hortonworks.com/blog/apache-spark-apache-zeppelin-whats-coming-in-hdp-2-4-2/
Cask撰文介绍?jin)?/span>Hbase region compactionq样|见事g发生的前后,他们是怎样通过长时间测试以评估分布式系l正性的?/span>
http://blog.cask.co/2016/04/long-running-tests-in-cdap/
本文介绍?jin)如何结?/span>SparkR与亚马?/span>EMRq行地理I间分析的。通过SparkR?/span>Hive集成lgQ可以立d?/span>S3上的数据映射Hive外部表。从q开始,数据p直接加蝲到内存(sh)使用R语言分析Q很Ҏ(gu)实现高质量的数据可视化?/span>
MapR~写?jin)?/span>Pig?/span>Hive分析职业球大联盟球队水q的教程?/span>Pig用于数据初加工,Hive提供ZSQL的数据查询环境。借助Hive ODBC驱动?/span>Hive服务器,使得微YExcel也能用于获取和分析数据?/span>
https://www.mapr.com/blog/using-hive-and-pig-baseball-statistics
SignalFX通过27节点?/span>Kafka集群每天处理700多亿条消息。只有基于他们积累的大规?/span>Kafka使用l验才能有如此高的量Q因此他们共享了(jin)不少调试Kafka的技巧,定位告警Q例如日志刷新gq增加)(j)Q以?/span>Kafka横向扩展?/span>
http://www.confluent.io/blog/how-we-monitor-and-run-kafka-at-scale-signalfx
dataArtisan's博客Z(jin)度量Flink在数据流效率、低延迟、正性上的能力,专门写了(jin)q篇文章。ؓ(f)?jin)证明效率,在高吞吐量的环境下运行?jin)最新的Yahoo!式基准试E序。在正确性方面,文章H出?/span>Flink事g判别和处理事Ӟ星球大战?sh)?jing)q表做类比)(j)斚w的优ѝ最后,文章描述?/span>Flink未来版本Z内存的查询Q务?/span>
http://data-artisans.com/counting-in-streams-a-hierarchy-of-needs/
本教E介l了(jin)怎样?/span>TCP Socket中的文本数据?hu){换ؓ(f)Spark式数据源?/span>
https://medium.com/@anicolaspp/spark-custom-streaming-sources-e7d52da72e80
本文介绍?jin)在构?/span>Hadoop的时候怎样防止AWS证书意外提交到补丁或git资源库。除Hadoop本n外,本文q徏议?/span>“git-secrets”工具防止意外提交讉K/安全密钥。如果你用的?/span>Hadoop S3Q还推荐?jin)新补丁供评估?/span>
http://steveloughran.blogspot.co.uk/2016/04/testing-against-s3-and-object-stores.html
Big Data & Brews采访?/span>MapR?/span>Ted Dunning?/span>Jacques Nadeau?/span>Apache Arrow也在本次采访范围内?/span>
https://www.youtube.com/watch?v=l3mDDKjDjMk
https://www.youtube.com/watch?v=Xo9CO0a0VJI
其他新闻
DataEngConf最q在旧金山召开。本文ȝ?/span>Uber?/span>Stripe?/span>Microsoft?/span>Instacart?/span>Jawbone的发a内容。也介绍?jin)?x)议主?/span>“数据U学在现实世界中是一个品和工程学科”?/span>
Hortonworks在上周都柏林举行?/span>HadoopƧ洲C(x)上大攑ּ彩?/span>ZDNet报导?jin)这些亮点,其中包括?/span>PivotalQ已转售l?/span>HDPQ的扩展合作Q与Syncosrt的{售协议,以及(qing)Atlas?/span>Ranger?/span>Zeppelin?/span>Metron的技术预览。报D?sh)绍?/span>Hortonworks?/span>Cloudera?/span>MapR产品的不同之处?/span>
Flink 2016C(x)在?ji)月于d国柏林D行。讨题征集将于六月末l束?/span>
http://flink.apache.org/news/2016/04/14/flink-forward-announce.html
YouTube上发布了(jin)Hadoop都柏林峰?x)演讲视频。正如预期的那样Q这些演讲内Ҏ(gu)?/span>Hadoop生态系l的各个部分?/span>
产品发布
Metascope是一个配?/span>Schedoscope?/span>Hadoop集群中进行元数据理的新工具。通过web界面Q利用数据沿袭它能洞察大量的数据。也提供(g)索、内嵌文档?/span>REST API{等功能?/span>
https://github.com/ottogroup/metascope
Apache HBase 1.2.1于本周发布,?/span>1.2.0的基上解决了(jin)27个问题。发布声明中重点介绍?jin)四个高(sh)先U的问题?/span>
Apache Mahout机器学习(fn)库发布了(jin)0.12.0版。该版本?/span>“Samsara”数学环境开始支?/span>Apache Flink?jin),q且是^台无关的。发布声明中分n?jin)?/span>Flink集成、已知问题、项目演q计划相关的内容?/span>
Apache Storm 1.0.0本周发布?jin)。亮点包括性能提升Q普遍提?/span>3倍以上)(j)、新的分布式~存API?/span>nimbus的高可用性、自动反压、动?/span>worker性能分析{等?/span>
http://storm.apache.org/2016/04/12/storm100-released.html
Apache KuduQ孵化中Q本周发布了(jin)0.8.0版。本ơ发布添加了(jin)Apache Flume sink、部分功能提升、修复了(jin)一?/span>bug?/span>
http://getkudu.io/releases/0.8.0/docs/release_notes.html
Cloudbreak本周发布?/span>1.2版,它ؓ(f)云环境提?/span>Hadoop集群Docker。新Ҏ(gu)包括支?/span>OpenStack以及(qing)定义服务器提供配|脚本?/span>
http://hortonworks.com/blog/announcing-cloudbreak-1-2/
Cloudera发布?/span>Cloudera Enterprise 5.4.10Q内|了(jin)Flume?/span>Hadoop?/span>HBase?/span>Hive?/span>Impala{组件?/span>
Presto Accumulo是个新项目,?/span>Accumulod数据提供?/span>Prestoq接器?/span>
https://github.com/bloomberg/presto-accumulo
zd
中国
?/span>
W?165 ?2016q??0?
启明星辰——q_和大数据整体l编?/strong>
本周Q包?/span>LinkedIn ?/span>Airbnb新开源项目在内的C产品q行?jin)重大版本发布。本期技术部分与式处理有关——Spark?/span>Flink?/span>Kafka{等Q新闻部分是关于Spark Summit ?/span>HbaseCon的会(x)议议E?/span>
Zalando发表?jin)他们是如何选择Apache Flink作ؓ(f)式处理框架的文章。该文章阐述?jin)对评h(hun)标准q行验证后得出的l论Q阐明了(jin)选择Apache Flink的主?/span>—在高吞吐量的情况下依然能保持低gq,真正的流式处理,开发h员支持?/span>
https://tech.zalando.com/blog/apache-showdown-flink-vs.-spark/
Cloudera博客刊登?jin)来?/span>Wargaming.net的文章,通过本文可了(jin)解到他们如何通过Kafka?/span>HBase?/span>Drools?/span>Spark构徏实时处理基础设施的。另外,在数据流E方面,他们介绍?jin)如何?/span>HBase的检索和序列化?/span>HBase?/span>Spark之间的数据本地化以及(qing)Spark计算斚w的优化措施?/span>
http://blog.cloudera.com/blog/2016/04/inside-wargamings-data-driven-real-time-rules-engine/
InfoQ发布?jin)大规模式处?/span>—SMACKQ?/span>Spark?/span>Mesos?/span>Akka?/span>Cassandra以及(qing) KafkaQ栈的介l视频。讨Z(jin)Z?/span>SMACK栈在处理同样问题的时候比Lambda架构更简单?/span>
http://www.infoq.com/presentations/stream-analytics-scalability
Confluent“日志压羃”pd博文又有更新Q介l了(jin)Kafka目三月份发生的事情。有不少令hx(chng)的开发内容,包括机架感知?/span>Kerberos支持、基于时间烦(ch)引方面的q展。以?qing)不你Q我也是Q没有时间持l关注的最新研发成果?/span>
Apache Flink 1.0引入?jin)新的复杂事件处理?/span>CEPQ库。啰嗦几句,CEP提供?jin)一U检事件模式的Ҏ(gu)。本文借助传感器从数据中心(j)服务器上攉数据Q运用一U可能的异常(g)用例,诠释?/span>Flink?/span>CEP模式API ?/span>
http://flink.apache.org/news/2016/04/06/cep-monitoring.html
Genome Analysis Toolkit Q?/span>GATKQ最q宣布,下一个版本(当前?/span>alphaQ将支持Apache Spark。本文简要介l了(jin)工具ƈ展示?jin)怎样通过Spark来检重?/span>DNA片段的?/span>
InfoWorldlD?/span>Spark2.0关于l构化流式处理方面的计划。微批处理将依然延箋(hu)Q还有些新特性,例如无限数据帧(Infinite DataFramesQ、一的重复查询支持?/span>
AWS大数据博客发布了(jin)一通过存储?/span>AWS Key Management Service Q?/span>KMSQ中的加密密钥加载数据到S3?/span>Redshift的文章。除?jin)描q所需步骤Q本文还?sh)绍了(jin)如何?/span>AWS S3中通过KMS密钥加密数据?/span>
Confluent博客介绍?jin)如何?/span>Kafka Connect ?/span> Kafka Streams ~写非凡?/span>“hello world”E序。更切地说Q范例程序从IRC拉维基百U数据,q解析消息、进行多斚w的统计计。本文还用了(jin)若干E序展示?jin)整个实现过E?/span>
http://www.confluent.io/blog/hello-world-kafka-connect-kafka-streams
本文?/span>Postgres ?/span> Cassandra转换单的模式Q?/span>schemasQ,q描qC(jin)主要的差?/span>—复制、数据类型(Cassandra不支?/span>JSONQ、主键、最l以一致性?/span>
http://neovintage.org/2016/04/07/data-modeling-in-cassandra-from-a-postgres-perspective/
ESG博客报导?jin)最q?/span>Strata+Hadoop World大会(x)的情c(din)ƈ有些重点x(chng)Q例?/span>Spark的良好势头、机器学?fn)、云服务?/span>
http://blog.esg-global.com/riding-high-at-stratahadoop-world
InformationWeek也报g(jin)Strata大会(x)Q关注了(jin)MapR?/span>Pivotal的关灯片、h工智能等?/span>
Spark Summit 2016议程敲定Q将?/span>6?/span>6-8日在旧金׃D行。会(x)议将有两天展开五个方向的讨论?/span>
https://databricks.com/blog/2016/04/04/agenda-announced-for-sparksummit-2016-in-san-francisco.html
布斯采访了(jin)Cloudera CEO Tom ReillyQ他讨论?jin)公司的机遇、竞争性市(jng)场、上?jng)计划等?/span>
Datanami撰文正在崛L(fng)Apache Kafka作ؓ(f)式处理的支柱。文章还采访?/span>Confluent联合创始人兼CTO Neha NarkhedeQ坊间她表示最q将推出Kafka Connect ?/span> Kafka Streams?/span>
http://www.datanami.com/2016/04/06/real-time-rise-apache-kafka/
HBaseCon于5?/span>24日在旧金山召开Q最q议E才正式宣布。在三个方向上,有20个以上的议题要讨论?/span>
http://blog.cloudera.com/blog/2016/04/hbasecon-2016-speaker-lineup-announced/
Apache HBase 0.98.18 ?/span>1.1.4最q都发布?jin)?/span>1.1.4上有包括?ji)个或正性在内的若干修复?/span>HBase 0.98.18答{的仅解决了(jin)50个问题(bug、改善两个新Ҏ(gu))(j)?/span>
http://mail-archives.apache.org/mod_mbox/hbase-user/201603.mbox/%3CCANZa%3DGu-mAxKEtfoRjctHcE0KD7z52oE010Fgsf6AMmW2tDZLA%40mail.gmail.com%3E
http://mail-archives.apache.org/mod_mbox/hbase-user/201603.mbox/%3CCA%2BRK%3D_CtZ1L07nS6Og2ekfVwet0qTE7jw-bmyD2pp5UPweUehQ%40mail.gmail.com%3E
Apache Lens发布?/span>2.5.0-betaQ作为统一分析接口Q它已经支持Hadoop生态系l的执行引擎数据存储?jin)。本ơ发布解决了(jin)87,主要?/span>bug修复和实现新功能?/span>
Airbnb 开源了(jin) CaravelQ数据探索系l(数据可视化^収ͼ(j)?/span>Caravel支持多种在商业品上才能看到的特性,能够q接CQ意只要支?/span>SQL方言的系l。尤其它支持面向Druid的实时分析?/span>
https://medium.com/airbnb-engineering/caravel-airbnb-s-data-exploration-platform-15a72aa610e5
MapR 宣布支持Apache Drill 1.6作ؓ(f)他们的分布式pȝ。比较有亮点的发布有MapR-DB新存储插件、新SQLH口函数支持以及(qing)端对端安全。在|页介绍部分Q有些?/span>MapR-DB API?/span>?/span>数据q?/span>q?/span>Drill查询的例子?/span>
Apache Flink发布?jin)修?/span>bug后的1.0.x。这ơ发布解决了(jin)23个问题,推荐所?/span>1.0.0的用户升U?/span>
http://flink.apache.org/news/2016/04/06/release-1.0.1.html
Cloudera Enterprise 5.7发布附带?/span>Spark?/span>HBase?/span>Impala?/span>Kafka{组件版本的升。本ơ发布的亮点包括?/span>Cloudera Labs 新鲜推荐?/span>Hive-on-Spark?/span>HBase-Spark?/span>Impala性能重要提升Q支?/span>SSD ?/span>HBase WAL?/span>
http://blog.cloudera.com/blog/2016/04/cloudera-enterprise-5-7-is-released/
Apache TajoQ构建在Hadoop上的数据仓库pȝQ发布了(jin)0.11.2版。新版本支持?/span>KerberosQ修复了(jin)ORC表对Hive的支持等?/span>
http://tajo.apache.org/releases/0.11.2/announcement.html
LinkedIn 开源了(jin) Dr. ElephantQ里面的工具能诊?/span>Hadoop?/span>Sparkd的性能问题。基?/span>metrics?/span>YARN资源理器收集已完成d数据Q?/span>Dr. Elephant评估后生成诊断报表,内容包括数据错位?/span>GC开销{?/span>LinkedIn宣称借助它能解决80%的问题?/span>
中国
?/span>