NoSQL學習（五）Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs HBase vs Couchbase vs Neo4j vs Hypertable vs ElasticSearch vs Accumulo vs VoltDB vs Scalaris comparison

Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vsHBase vs Couchbase vs Neo4j vs Hypertable vsElasticSearch vs Accumulo vs VoltDB vs Scalariscomparison

http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis/
KKovacsKristof Kovacs
Software architect, consultant

(Yes it's a long title, since people kept asking me to write about this and that too :) I do when it has a point.)

While SQL databases are insanely useful tools, their monopoly in the last decades is coming to an end. And it's just time: I can't even count the things that were forced into relational databases, but never really fitted them. (That being said, relational databases will always be the best for the stuff that has relations.)

But, the differences between NoSQL databases are much bigger than ever was between one SQL database and another. This means that it is a bigger responsibility on software architects to choose the appropriate one for a project right at the beginning.

In this light, here is a comparison of Cassandra, Mongodb, CouchDB, Redis, Riak, Couchbase (ex-Membase), Hypertable, ElasticSearch, Accumulo, VoltDB, Kyoto Tycoon, Scalaris, Neo4j and HBase:

The most popular ones

MongoDB (2.2)

· Written in: C++

· Main point: Retains some friendly properties of SQL. (Query, index)

· License: AGPL (Drivers: Apache)

· Protocol: Custom, binary (BSON)

· Master/slave replication (auto failover with replica sets)

· Sharding built-in

· Queries are javascript expressions

· Run arbitrary javascript functions server-side

· Better update-in-place than CouchDB

· Uses memory mapped files for data storage

· Performance over features

· Journaling (with --journal) is best turned on

· On 32bit systems, limited to ~2.5Gb

· An empty database takes up 192Mb

· GridFS to store big data + metadata (not actually an FS)

· Has geospatial indexing

· Data center aware

Best used: If you need dynamic queries. If you prefer to define indexes, not map/reduce functions. If you need good performance on a big DB. If you wanted CouchDB, but your data changes too much, filling up disks.

For example: For most things that you would do with MySQL or PostgreSQL, but having predefined columns really holds you back.

Riak (V1.2)

· Written in: Erlang & C, some JavaScript

· Main point: Fault tolerance

· License: Apache

· Protocol: HTTP/REST or custom binary

· Stores blobs

· Tunable trade-offs for distribution and replication

· Pre- and post-commit hooks in JavaScript or Erlang, for validation and security.

· Map/reduce in JavaScript or Erlang

· Links & link walking: use it as a graph database

· Secondary indices: but only one at once

· Large object support (Luwak)

· Comes in "open source" and "enterprise" editions

· Full-text search, indexing, querying with Riak Search

· In the process of migrating the storing backend from "Bitcask" to Google's "LevelDB"

· Masterless multi-site replication replication and SNMP monitoring are commercially licensed

Best used: If you want something Dynamo-like data storage, but no way you're gonna deal with the bloat and complexity. If you need very good single-site scalability, availability and fault-tolerance, but you're ready to pay for multi-site replication.

For example: Point-of-sales data collection. Factory control systems. Places where even seconds of downtime hurt. Could be used as a well-update-able web server.

CouchDB (V1.2)

· Written in: Erlang

· Main point: DB consistency, ease of use

· License: Apache

· Protocol: HTTP/REST

· Bi-directional (!) replication,

· continuous or ad-hoc,

· with conflict detection,

· thus, master-master replication. (!)

· MVCC - write operations do not block reads

· Previous versions of documents are available

· Crash-only (reliable) design

· Needs compacting from time to time

· Views: embedded map/reduce

· Formatting views: lists & shows

· Server-side document validation possible

· Authentication possible

· Real-time updates via '_changes' (!)

· Attachment handling

· thus, CouchApps (standalone js apps)

Best used: For accumulating, occasionally changing data, on which pre-defined queries are to be run. Places where versioning is important.

For example: CRM, CMS systems. Master-master replication is an especially interesting feature, allowing easy multi-site deployments.

Redis (V2.4)

· Written in: C/C++

· Main point: Blazing fast

· License: BSD

· Protocol: Telnet-like

· Disk-backed in-memory database,

· Currently without disk-swap (VM and Diskstore were abandoned)

· Master-slave replication

· Simple values or hash tables by keys,

· but complex operations like ZREVRANGEBYSCORE.

· INCR & co (good for rate limiting or statistics)

· Has sets (also union/diff/inter)

· Has lists (also a queue; blocking pop)

· Has hashes (objects of multiple fields)

· Sorted sets (high score table, good for range queries)

· Redis has transactions (!)

· Values can be set to expire (as in a cache)

· Pub/Sub lets one implement messaging (!)

Best used: For rapidly changing data with a foreseeable database size (should fit mostly in memory).

For example: Stock prices. Analytics. Real-time data collection. Real-time communication. And wherever you used memcached before.

Clones of Google's Bigtable

HBase (V0.92.0)

· Written in: Java

· Main point: Billions of rows X millions of columns

· License: Apache

· Protocol: HTTP/REST (also Thrift)

· Modeled after Google's BigTable

· Uses Hadoop's HDFS as storage

· Map/reduce with Hadoop

· Query predicate push down via server side scan and get filters

· Optimizations for real time queries

· A high performance Thrift gateway

· HTTP supports XML, Protobuf, and binary

· Jruby-based (JIRB) shell

· Rolling restart for configuration changes and minor upgrades

· Random access performance is like MySQL

· A cluster consists of several different types of nodes

Best used: Hadoop is probably still the best way to run Map/Reduce jobs on huge datasets. Best if you use the Hadoop/HDFS stack already.

For example: Search engines. Analysing log data. Any place where scanning huge, two-dimensional join-less tables are a requirement.

Cassandra (1.2)

· Written in: Java

· Main point: Best of BigTable and Dynamo

· License: Apache

· Protocol: Thrift & custom binary CQL3

· Tunable trade-offs for distribution and replication (N, R, W)

· Querying by column, range of keys (Requires indices on anything that you want to search on)

· BigTable-like features: columns, column families

· Can be used as a distributed hash-table, with an "SQL-like" language, CQL (but no JOIN!)

· Data can have expiration (set on INSERT)

· Writes can be much faster than reads (when reads are disk-bound)

· Map/reduce possible with Apache Hadoop

· All nodes are similar, as opposed to Hadoop/HBase

· Very good and reliable cross-datacenter replication

Best used: When you write more than you read (logging). If every component of the system must be in Java. ("No one gets fired for choosing Apache's stuff.")

For example: Banking, financial industry (though not necessarily for financial transactions, but these industries are much bigger than that.) Writes are faster than reads, so one natural niche is data analysis.

Hypertable (0.9.6.5)

· Written in: C++

· Main point: A faster, smaller HBase

· License: GPL 2.0

· Protocol: Thrift, C++ library, or HQL shell

· Implements Google's BigTable design

· Run on Hadoop's HDFS

· Uses its own, "SQL-like" language, HQL

· Can search by key, by cell, or for values in column families.

· Search can be limited to key/column ranges.

· Sponsored by Baidu

· Retains the last N historical values

· Tables are in namespaces

· Map/reduce with Hadoop

Best used: If you need a better HBase.

For example: Same as HBase, since it's basically a replacement: Search engines. Analysing log data. Any place where scanning huge, two-dimensional join-less tables are a requirement.

Accumulo (1.4)

· Written in: Java and C++

· Main point: A BigTable with Cell-level security

· License: Apache

· Protocol: Thrift

· Another BigTable clone, also runs of top of Hadoop

· Cell-level security

· Bigger rows than memory are allowed

· Keeps a memory map outside Java, in C++ STL

· Map/reduce using Hadoop's facitlities (ZooKeeper & co)

· Some server-side programming

Best used: If you need a different HBase.

For example: Same as HBase, since it's basically a replacement: Search engines. Analysing log data. Any place where scanning huge, two-dimensional join-less tables are a requirement.

Special-purpose

Neo4j (V1.5M02)

· Written in: Java

· Main point: Graph database - connected data

· License: GPL, some features AGPL/commercial

· Protocol: HTTP/REST (or embedding in Java)

· Standalone, or embeddable into Java applications

· Full ACID conformity (including durable data)

· Both nodes and relationships can have metadata

· Integrated pattern-matching-based query language ("Cypher")

· Also the "Gremlin" graph traversal language can be used

· Indexing of nodes and relationships

· Nice self-contained web admin

· Advanced path-finding with multiple algorithms

· Indexing of keys and relationships

· Optimized for reads

· Has transactions (in the Java API)

· Scriptable in Groovy

· Online backup, advanced monitoring and High Availability is AGPL/commercial licensed

Best used: For graph-style, rich or complex, interconnected data. Neo4j is quite different from the others in this sense.

For example: For searching routes in social relations, public transport links, road maps, or network topologies.

ElasticSearch (0.20.1)

· Written in: Java

· Main point: Advanced Search

· License: Apache

· Protocol: JSON over HTTP (Plugins: Thrift, memcached)

· Stores JSON documents

· Has versioning

· Parent and children documents

· Documents can time out

· Very versatile and sophisticated querying, scriptable

· Write consistency: one, quorum or all

· Sorting by score (!)

· Geo distance sorting

· Fuzzy searches (approximate date, etc) (!)

· Asynchronous replication

· Atomic, scripted updates (good for counters, etc)

· Can maintain automatic "stats groups" (good for debugging)

· Still depends very much on only one developer (kimchy).

Best used: When you have objects with (flexible) fields, and you need "advanced search" functionality.

For example: A dating service that handles age difference, geographic location, tastes and dislikes, etc. Or a leaderboard system that depends on many variables.

The "long tail"
(Not widely known, but definitely worthy ones)

Couchbase (ex-Membase) (2.0)

· Written in: Erlang & C

· Main point: Memcache compatible, but with persistence and clustering

· License: Apache

· Protocol: memcached + extensions

· Very fast (200k+/sec) access of data by key

· Persistence to disk

· All nodes are identical (master-master replication)

· Provides memcached-style in-memory caching buckets, too

· Write de-duplication to reduce IO

· Friendly cluster-management web GUI

· Connection proxy for connection pooling and multiplexing (Moxi)

· Incremental map/reduce

· Cross-datacenter replication

Best used: Any application where low-latency data access, high concurrency support and high availability is a requirement.

For example: Low-latency use-cases like ad targeting or highly-concurrent web apps like online gaming (e.g. Zynga).

Scalaris (0.5)

· Written in: Erlang

· Main point: Distributed P2P key-value store

· License: Apache

· Protocol: Proprietary & JSON-RPC

· In-memory (disk when using Tokyo Cabinet as a backend)

· Uses YAWS as a web server

· Has transactions (an adapted Paxos commit)

· Consistent, distributed write operations

· From CAP, values Consistency over Availability (in case of network partitioning, only the bigger partition works)

Best used: If you like Erlang and wanted to use Mnesia or DETS or ETS, but you need something that is accessible from more languages (and scales much better than ETS or DETS).

For example: In an Erlang-based system when you want to give access to the DB to Python, Ruby or Java programmers.

VoltDB (2.8.4.1)

· Written in: Java

· Main point: Fast transactions and rapidly changing data

· License: GPL 3

· Protocol: Proprietary

· In-memory relational database.

· Can export data into Hadoop

· Supports ANSI SQL

· Stored procedures in Java

· Cross-datacenter replication

Best used: Where you need to act fast on massive amounts of incoming data.

For example: Point-of-sales data analysis. Factory control systems.

Kyoto Tycoon (0.9.56)

· Written in: C++

· Main point: A lightweight network DBM

· License: GPL

· Protocol: HTTP (TSV-RPC or REST)

· Based on Kyoto Cabinet, Tokyo Cabinet's successor

· Multitudes of storage backends: Hash, Tree, Dir, etc (everything from Kyoto Cabinet)

· Kyoto Cabinet can do 1M+ insert/select operations per sec (but Tycoon does less because of overhead)

· Lua on the server side

· Language bindings for C, Java, Python, Ruby, Perl, Lua, etc

· Uses the "visitor" pattern

· Hot backup, asynchronous replication

· background snapshot of in-memory databases

· Auto expiration (can be used as a cache server)

Best used: When you want to choose the backend storage algorithm engine very precisely. When speed is of the essence.

For example: Caching server. Stock prices. Analytics. Real-time data collection. Real-time communication. And wherever you used memcached before.

Of course, all these systems have much more features than what's listed here. I only wanted to list the key points that I base my decisions on. Also, development of all are very fast, so things are bound to change.

MongoDB和Redis對比

http://taotao1240.blog.51cto.com/731446/755173
taojin1240 的BLOG

google了下，看到有篇英文版的對比：

英文來自——http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis/

于是就做了個表格，加上自己使用的一些體會，就有了此文。

MongoDB	Redis (V2.4)	說明
Written in: C++	Written in: C/C++
Main point：Retains some friendly properties of SQL. (Query, index)	Main point: Blazing fast	MongoDB保留類似SQL的屬性，例如：show dbs;db.test.find() Redis—快
License: AGPL (Drivers: Apache)	License: BSD
Protocol: Custom, binary (BSON)	Protocol: Telnet-like
Master/slave replication (auto failover with replica sets) 主從復制+replica sets	Master-slave replication 主從復制
Sharding built-in 內置的sharding分片功能		MongoDB一般會使用replica sets和sharding功能結合，replica sets側重高可用性及高可靠性，而sharding側重于性能、易擴展
Queries are javascript expressions 查詢是javascript語句
Run arbitrary javascript functions server-side 運行任意的server-side javascript函數
Better update-in-place than CouchDB update-in-place的支持比CouchDB更好
Uses memory mapped files for data storage 使用內存轉儲文件做數據存儲	Disk-backed in-memory database, Currently without disk-swap (VM and Diskstore were abandoned) 磁盤做后備、內存數據庫目前2.4版本不帶disk-swap（虛擬內存和diskstore被舍棄了）
Performance over features （性能優于特性）
Journaling (with --journal) is best turned on （Journaling日志功能最好打開）
On 32bit systems, limited to ~2.5Gb 在32位平臺MongoDB不允許數據庫文件（累計總和）超過2.5G，而64位平臺沒有這個限制。
An empty database takes up 192Mb 空數據庫大約占 192Mb
GridFS to store big data + metadata (not actually an FS) 使用GridFS存儲大數據和元數據（不是真正意義上的文件系統）		GridFS是一種將大型文件存儲在MongoDB的文件規范。
	Values can be set to expire (as in a cache) 可以設置value過期（由于在內存中）	expire name 10 例如：設置name這個value的過期時間是10S
	Simple values or hash tables by keys，but complex operations like ZREVRANGEBYSCORE. INCR & co (good for rate limiting or statistics)	使用簡單值或以key值為索引的哈希表，也支持復雜的例如ZREVRANGEBYSCORE的有序集操作
	Has sets (also union/diff/inter) Has lists (also a queue; blocking pop) Has hashes (objects of multiple fields) Sorted sets (high score table, good for range queries)	有很多類型的數據，包括sets,lists,hash,有序集
	Redis has transactions (!)	redis支持事物處理
	Pub/Sub lets one implement messaging (!)	Pub/Sub允許用戶實現消息機制，因此redis用于新浪微博中
適用——動態查詢; 索引比map/reduce方式更合適時; 對于大數據庫性能要求高，需要和CouchDB的功能一樣,但數據變化大	適用——數據庫大小快速變化并且總量可預測的，對內存要求高
舉例——大部分用Mysql/PostgreSQL的場合，但是無法使用預先定義好所有列的時候	舉例——股票價格、統計分析、實時數據收集、實時通信

關于 redis、memcache、mongoDB 的對比（整理）

(PHPer.yang www.imop.us)

從以下幾個維度，對 redis、memcache、mongoDB 做了對比。

1、性能

都比較高，性能對我們來說應該都不是瓶頸。

總體來講，TPS 方面 redis 和 memcache 差不多，要大于 mongodb。

2、操作的便利性

memcache 數據結構單一。（key-value）

redis 豐富一些，數據操作方面，redis 更好一些，較少的網絡 IO 次數，同時還提供 list，set，hash 等數據結構的存儲。

mongodb 支持豐富的數據表達，索引，最類似關系型數據庫，支持的查詢語言非常豐富。

3、內存空間的大小和數據量的大小

redis 在 2.0 版本后增加了自己的 VM 特性，突破物理內存的限制；可以對 key value 設置過期時間（類似 memcache）

memcache 可以修改最大可用內存,采用 LRU 算法。Memcached 代理軟件 magent，比如建立

10 臺 4G 的 Memcache 集群，就相當于有了 40G。 magent -s 10.1.2.1 -s 10.1.2.2:11211 -b 10.1.2.3:14000

mongoDB 適合大數據量的存儲，依賴操作系統 VM 做內存管理，吃內存也比較厲害，服務不要和別的服務在一起。

4、可用性（單點問題）

對于單點問題，

redis，依賴客戶端來實現分布式讀寫；主從復制時，每次從節點重新連接主節點都要依賴整個快照,無增量復制，
因性能和效率問題，所以單點問題比較復雜；不支持自動 sharding,需要依賴程序設定一致 hash 機制。

一種替代方案是，不用 redis 本身的復制機制，采用自己做主動復制（多份存儲），或者改成

增量復制的方式（需要自己實現），一致性問題和性能的權衡

Memcache 本身沒有數據冗余機制，也沒必要；對于故障預防，采用依賴成熟的 hash 或者環

狀的算法，解決單點故障引起的抖動問題。

mongoDB 支持 master-slave,replicaset（內部采用 paxos 選舉算法，自動故障恢復）,auto sharding

機制，對客戶端屏蔽了故障轉移和切分機制。

5、可靠性（持久化）

對于數據持久化和數據恢復，

redis 支持（快照、AOF）：依賴快照進行持久化，aof 增強了可靠性的同時，對性能有所影響

memcache 不支持，通常用在做緩存,提升性能；

MongoDB 從 1.8 版本開始采用 binlog 方式支持持久化的可靠性

6、數據一致性（事務支持）

Memcache 在并發場景下，用 cas 保證一致性
redis 事務支持比較弱，只能保證事務中的每個操作連續執行

mongoDB 不支持事務

7、數據分析

mongoDB 內置了數據分析的功能(mapreduce),
其他不支持

8、應用場景

redis：數據量較小的更性能操作和運算上

memcache：用于在動態系統中減少數據庫負載，提升性能;
做緩存，提高性能（適合讀多寫少，對于數據量比較大，可以采用 sharding）

MongoDB:主要解決海量數據的訪問效率問題。

表格比較：

	memcache	redis
類型	內存數據庫	內存數據庫
數據類型	在定義value時就要固定數據類型	不需要有字符串，鏈表，集合和有序集合
虛擬內存	不支持	支持
過期策略	支持	支持
分布式	magent	master-slave，一主一從或一主多從
存儲數據安全	不支持	使用save存儲到dump.rdb中
災難恢復	不支持	append only file(aof)用于數據恢復
性能

性能

1、類型——memcache 和 redis 都是將數據存放在內存，所以是內存數據庫。
當然，memcache也可用于緩存其他東西，例如圖片等等。

2、數據類型——Memcache 在添加數據時就要指定數據的字節長度,而 redis 不需要。

3、虛擬內存——當物理內存用完時，可以將一些很久沒用到的 value 交換到磁盤。

4、過期策略——memcache 在 set 時就指定，例如 set key1 0 0 8,即永不過期。
Redis 可以通過expire 設定，例如 expire name 10。

5、分布式——設定 memcache 集群，利用 magent 做一主多從;redis 可以做一主多從。都可以一主一從。

6、存儲數據安全——memcache 斷電就斷了，數據沒了；redis 可以定期 save 到磁盤。

7、災難恢復——memcache 同上，redis 丟了后可以通過 aof 恢復。

posted on 2014-01-14 01:34 crazycy 閱讀(1364) 評論(0) 編輯收藏所屬分類: JavaEE技術、DBMS

新用戶注冊刷新評論列表


只有注冊用戶登錄后才能發表評論。




網站導航: 博客園 IT新聞 Chat2DB C++博客博問管理
相關文章: SpringMVC+MyBatis - 16 Maven部署Web項目報錯webxml attribute is required SpringMVC+MyBatis - 13 回頭看spring mvc:annotation-driven對應的消息轉換（包含日期處理）-系列12的強化 SpringMVC+MyBatis - 12 spring mvc4返回的json日期為Long的解決方案 SpringMVC+MyBatis - 11 SiteMash的一個小陷阱 SpringMVC+MyBatis - 10 I18N標簽的使用 SpringMVC+MyBatis - 9 Spring的EnCache(Shiro Cache的解決方案是基于這個文章的) SpringMVC+MyBatis - 8 Shiro異常：EhCache initialization exception: Another unnamed CacheManager already exists in the same VM SpringMVC+MyBatis - 6 SpringMVC Restful風格下的靜態資源 SpringMVC+MyBatis - 5 Security-Shiro-01 SpringMVC+MyBatis - 4 Spring請求參數

cuiyi's blog（崔毅 crazycy）

NoSQL學習（五）Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs HBase vs Couchbase vs Neo4j vs Hypertable vs ElasticSearch vs Accumulo vs VoltDB vs Scalaris comparison

Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vsHBase vs Couchbase vs Neo4j vs Hypertable vsElasticSearch vs Accumulo vs VoltDB vs Scalariscomparison

導航

我參與的團隊

隨筆分類

相冊

積分與排名

最新評論

閱讀排行榜