<rt id="bn8ez"></rt>
<label id="bn8ez"></label>

  • <span id="bn8ez"></span>

    <label id="bn8ez"><meter id="bn8ez"></meter></label>

    gembin

    OSGi, Eclipse Equinox, ECF, Virgo, Gemini, Apache Felix, Karaf, Aires, Camel, Eclipse RCP

    HBase, Hadoop, ZooKeeper, Cassandra

    Flex4, AS3, Swiz framework, GraniteDS, BlazeDS etc.

    There is nothing that software can't fix. Unfortunately, there is also nothing that software can't completely fuck up. That gap is called talent.

    About Me

     

    HBase vs. Cassandra: NoSQL Battle! [轉]

    from http://www.roadtofailure.com/2009/10/29/hbase-vs-cassandra-nosql-battle/

    (Note: Check out the Drawn to Scale platform. Store, query, search, process, and serve *all* your data. To *all* your users. In real time).

    Distributed, scalable databases are desperately needed these days. From building massive data warehouses at a social media startup, to protein folding analysis at a biotech company, “Big Data” is becoming more important every day. While Hadoop has emerged as the de facto standard for handling big data problems, there are still quite a few distributed databases out there and each has their unique strengths.

    Two databases have garnered the most attention: HBase and Cassandra. The split between these equally ambitious projects can be categorized into Features (things missing that could be added any at time), and Architecture (fundamental differences that can’t be coded away). HBase is a near-clone of Google’s BigTable, whereas Cassandra purports to being a “BigTable/Dynamo hybrid”.

    In my opinion, while Cassandra’s “writes-never-fail” emphasis has its advantages, HBase is the more robust database for a majority of use-cases. Cassandra relies mostly on Key-Value pairs for storage, with a table-like structure added to make more robust data structures possible. And it’s a fact that far more people are using HBase than Cassandra at this moment, despite both being similarly recent.

    Let’s explore the differences between the two in more detail…

    CAP and You

    This article at Streamy explains CAP theorem (Consistency, Availability, Partitioning) and how the BigTable-derived HBase and the Dynamo-derived Cassandra differ.

    Before we go any further, let’s break it down as simply as possible:

    • Consistency: “Is the data I’m looking at now the same if I look at it somewhere else?”
    • Availability: “What happens if my database goes down?”
    • Partitioning: What if my data is on different networks?”

    CAP posits that distributed systems have to compromise on each, and HBase values strong consistency and High Availability while Cassandra values Availability and Partitioning tolerance. Replication is one way of dealing with some of the design tradeoffs. HBase does not have replication yet, but that’s about to change — and Cassandra’s replication comes with some caveats and penalties.

    Let’s go over some comparisons between these two datastores:

    Feature Comparisons

    Processing

    HBase is part of the Hadoop ecosystems, so many useful distributed processing frameworks support it: Pig, Cascading, Hive, etc. This makes it easy to do complex data analytics without resorting to hand-coding. Efficiently running MapReduce on Cassandra, on the other hand, is difficult because all of its keys are in one big “space”, so the MapReduce framework doesn’t know how to split and divide the data natively. There needs to be some hackery in place to handle all of that.

    In fact, here’s some code from a Cassandra/Hadoop Integration patch:

    + /*
    +  FIXME This is basically a huge kludge because we needed access to
    + cassandra internals, and needed access to hadoop internals and so we
    + have to boot cassandra when we run hadoop. This is all pretty
    + fucking awful.
    +
    +  P.S. it does not boot the thrift interface.
    + */

    This gives me The Fear.

    Bottom line? Cassandra may be useful for storage, but not any data processing. HBase is much handier for that.

    Installation & Ease of Use

    Cassandra is only a Ruby gem install away. That’s pretty impressive. You still have to do quite a bit of manual configuration, however. HBase is a .tar (or packaged by Cloudera) that you need to install and setup on your own. HBase has thorough documentation, though, making the process a little more straightforward than it could’ve been.

    HBase ships with a very nice Ruby shell that makes it easy to create and modify databases, set and retrieve data, and so on. We use it constantly to test our code. Cassandra does not have a shell at all — just a basic API. HBase also has a nice web-based UI that you can use to view cluster status, determine which nodes store various data, and do some other basic operations. Cassandra lacks this web UI as well as a shell, making it harder to operate. (ed: Apparently, there is now a shell and pretty basic UI — I just couldn’t find ‘em).

    Overall Cassandra wins on installation, but lags on usability.

    Architecture

    The fundamental divergence of ideas and architecture behind Cassandra and HBase drives much of the controversy over which is better.

    Off the bat, Cassandra claims that “writes never fail”, whereas in HBase, if a region server is down, writes will be blocked for affected data until the data is redistributed. This rarely happens in practice, of course, but will happen in a large enough cluster. In addition, HBase has a single point-of-failure (the Hadoop NameNode), but that will be less of an issue as Hadoop evolves. HBase does have row locking, however, which Cassandra does not.

    Apps usually rely on data being accurate and unchanged from the time of access, so the idea of eventual consistency can be a problem. Cassandra, however, has an internal method of resolving up-to-dateness issues with vector clocks — a complex but workable solution where basically the latest timestamp wins. The HBase/BigTable puts the impetus of resolving any consistency conflicts on the application, as everything is stored versioned by timestamp.

    Another architectural quibble is that Cassandra only supports one table per install. That means you can’t denormalize and duplicate your data to make it more usable in analytical scenarios. (edit: this was corrected in the latest release) Cassandra is really more of a Key Value store than a Data Warehouse. Furthermore, schema changes require a cluster restart(!). Here’s what the Cassandra JIRA says to do for a schema change:

    1. Kill Cassandra
    2. Start it again and wait for log replay to finish
    3. Kill Cassandra AGAIN
    4. Make your edits (now there is no data in the commitlog)
    5. Manually remove the sstable files (-Data.db, -Index.db, and -Filter.db) for the CFs you removed, and rename files for CFs you renamed
    6. Start Cassandra and your edits should take effect

    With the lack of timestamp versioning, eventual consistency, no regions (making things like MapReduce difficult), and only one table per install, it’s difficult to claim that Cassandra implements the BigTable model.

    Replication

    Cassandra is optimized for small datacenters (hundreds of nodes) connected by very fast fiber. It’s part of Dynamo’s legacy from Amazon. HBase, being based on research originally published by Google, is happy to handle replication to thousands of planet-strewn nodes across the ’slow’, unpredictable Internet.

    A major difference between the two projects is their approach to replication and multiple datacenters. Cassandra uses a P2P sharing model, whereas HBase (the upcoming version) employs more of a data+logs backup method, aka ‘log shipping’. Each has a certain elegance. Rather than explain this in words, here comes the drawings:

    This first diagram is a model of the Cassandra replication scheme.

    1. The value is written to the “Coordinator” node
    2. A duplicate value is written to another node in the same cluster
    3. A third and fourth value are written from the Coordinator to another cluster across the high-speed fiber
    4. A fifth and sixth value are written from the Coordinator to a third cluster across the fiber
    5. Any conflicts are resolved in the cluster by examining timestamps and determining the “best” value.

    The major problem with this scheme is that there is no real-world auditability. The nodes are eventually consistent — if a datacenter (“DC”) fails, it’s impossible to tell when the required number of replicas will be up-to-date. This can be extremely painful in a live situation — when one of your DCs goes down, you often want to know *exactly* when to expect data consistency so that recovery operations can go ahead smoothly.

    It’s important to note that Cassandra relies on high-speed fiber between datacenters. If your writes are taking 1 or 2 ms, that’s fine. But when a DC goes out and you have to revert to a secondary one in China instead of 20 miles away, the incredible latency will lead to write timeouts and highly inconsistent data.

    Let’s take a look at the HBase replication model (note: this is coming in the .21 release):

    What’s going on here:

    1. The data is written to the HBase write-ahead-log in RAM, then it is then flushed to disk
    2. The file on disk is automatically replicated due to the Hadoop Filesystem’s nature
    3. The data enters a “Replication Log”, where it is piped to another Data Center.

    With HBase/Hadoop’s deliberate sequence of events, consistency within a datacenter is high. There is usually only one piece of data around the same time period. If there are not, then HBase’s timestamps allow your code to figure out which version is the “correct” one, instead of it being chosen by the cluster. Due to the nature of the Replication Log, one can always tell the state of the data consistency at any time — a valuable tool to have when another data center goes down. In addition, using this structure makes it easy to recover from high-latency scenarios that can occur with inter-continental data transfer.

    Knowing Which To Choose

    The business context of Amazon and Google explains the emphasis on different functionality between Cassandra and HBase.

    Cassandra expects High Speed Network Links between data centers. This is an artifact of Amazon’s Dynamo: Amazon datacenters were historically located very close to each other (dozens of miles apart) with very fast fiber optic cables between them. Google, however, had transcontinental datacenters which were connected only by the standard Internet, which means they needed a more reliable replication mechanism than the P2P eventual consistency.

    If you need highly available writes with only eventual consistency, then Cassandra is a viable candidate for now. However, many apps are not happy with eventual consistency, and it is still lacking many features. Furthermore, even if writes do not fail, there is still cluster downtime associated with even minor schema changes. HBase is more focused on reads, but can handle very high read and write throughput. It’s much more Data Warehouse ready, in addition to serving millions of requests per second. The HBase integration with MapReduce makes it valuable, and versatile.

    posted on 2010-03-30 13:01 gembin 閱讀(1311) 評論(0)  編輯  收藏 所屬分類: hbase

    導航

    統計

    常用鏈接

    留言簿(6)

    隨筆分類(440)

    隨筆檔案(378)

    文章檔案(6)

    新聞檔案(1)

    相冊

    收藏夾(9)

    Adobe

    Android

    AS3

    Blog-Links

    Build

    Design Pattern

    Eclipse

    Favorite Links

    Flickr

    Game Dev

    HBase

    Identity Management

    IT resources

    JEE

    Language

    OpenID

    OSGi

    SOA

    Version Control

    最新隨筆

    搜索

    積分與排名

    最新評論

    閱讀排行榜

    評論排行榜

    free counters
    主站蜘蛛池模板: 免费在线观看中文字幕| 亚洲av无码天堂一区二区三区| 添bbb免费观看高清视频| 久久亚洲AV午夜福利精品一区 | 免费一级毛片正在播放| 成人毛片免费观看视频在线| 99视频在线免费看| 成人A毛片免费观看网站| 特级毛片A级毛片100免费播放| 亚洲一卡2卡3卡4卡国产网站| 亚洲精品午夜视频| 国产产在线精品亚洲AAVV| 亚洲人AV在线无码影院观看| 亚洲最大的成人网| 亚洲av无码专区亚洲av不卡| 亚洲AV永久无码精品一福利| 亚洲人成色在线观看| WWW亚洲色大成网络.COM| 黄色三级三级三级免费看| 午夜亚洲乱码伦小说区69堂| 羞羞视频免费观看| 9i9精品国产免费久久| 亚欧日韩毛片在线看免费网站| **aaaaa毛片免费同男同女| 国产裸模视频免费区无码| 亚洲VA综合VA国产产VA中| 亚洲处破女AV日韩精品| 亚洲欧美一区二区三区日产| 亚洲日韩在线中文字幕综合| 免费无码又爽又刺激网站直播| 中文字幕在线免费| 狠狠躁狠狠爱免费视频无码| 在线免费观看国产| 色播在线永久免费视频| 国产成人精品日本亚洲专区61| 亚洲精品成人网站在线播放| 国产午夜亚洲精品不卡电影| 毛片无码免费无码播放| 四虎在线播放免费永久视频 | 亚洲国产av一区二区三区| 91亚洲国产在人线播放午夜|