ChatGPT解决这个技术问题 Extra ChatGPT

MongoDB vs. Cassandra [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers. We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations. Closed 5 years ago. Improve this question

I am evaluating what might be the best migration option.

Currently, I am on a sharded MySQL (horizontal partition), with most of my data stored in JSON blobs. I do not have any complex SQL queries (already migrated away after since I partitioned my db).

Right now, it seems like both MongoDB and Cassandra would be likely options. My situation:

Lots of reads in every query, less regular writes

Not worried about "massive" scalability

More concerned about simple setup, maintenance and code

Minimize hardware/server cost

An official performance benchmark statistics is available. Cassandra vs MongoDB vs HBase
>Lots of reads in every query, less regular writes => Look for CQRS (separate your reads from your writes probably without event sourcing but check whether you can update your read model async .. sync may work too .. it depends on your use-cases)
This is a great question actually. I wonder if there is an updated version of it? This one is very old now

E
Esteban Verbel

Lots of reads in every query, fewer regular writes

Both databases perform well on reads where the hot data set fits in memory. Both also emphasize join-less data models (and encourage denormalization instead), and both provide indexes on documents or rows, although MongoDB's indexes are currently more flexible.

Cassandra's storage engine provides constant-time writes no matter how big your data set grows. Writes are more problematic in MongoDB, partly because of the b-tree based storage engine, but more because of the multi-granularity locking it does.

For analytics, MongoDB provides a custom map/reduce implementation; Cassandra provides native Hadoop support, including for Hive (a SQL data warehouse built on Hadoop map/reduce) and Pig (a Hadoop-specific analysis language that many think is a better fit for map/reduce workloads than SQL). Cassandra also supports use of Spark.

Not worried about "massive" scalability

If you're looking at a single server, MongoDB is probably a better fit. For those more concerned about scaling, Cassandra's no-single-point-of-failure architecture will be easier to set up and more reliable. (MongoDB's global write lock tends to become more painful, too.) Cassandra also gives a lot more control over how your replication works, including support for multiple data centers.

More concerned about simple setup, maintenance and code

Both are trivial to set up, with reasonable out-of-the-box defaults for a single server. Cassandra is simpler to set up in a multi-server configuration since there are no special-role nodes to worry about.

If you're presently using JSON blobs, MongoDB is an insanely good match for your use case, given that it uses BSON to store the data. You'll be able to have richer and more queryable data than you would in your present database. This would be the most significant win for Mongo.


Totally different, a comment isn't big enough, but ... Cassandra is a linearly scalable (amortized constant time reads & writes) dynamo/google bigtable hybrid that features fast writes regardless of data size. It's feature set is minimalistic, little beyond that of an ordered key value store. MongoDB is a heavily featured (and fast) document store at the cost of durability and guarantees about writes persisting (since they're not immediately written to disk). They're different beasts with different philosophies, MongoDB's closer to a RDMS replacement ...
while Cassandra is lower level but allows for uber scaling (see Twitter/Digg/Facebook), but you're going to have to be deliberate in how you lay your data out, build secondary indexes etc, since no flexible querying is allowed.
Because everyone mentioned twitter here in relation to Cassandra: they are not using Cassandra for persisting tweets, they use still MySQL here (engineering.twitter.com/2010/07/cassandra-at-twitter-today.html). Ok, but I can imagine that they still store lots of data for other purposes in Cassandra.
It looks like the global write lock may have been removed in Mongo 2.2...
Even before my project went live, I am feeling the pain points of Mongodb. Hot backup is a basic requirement. To do a hot backup in a Linux server, you have to first setup a LVM partition (not so common) and take a snapshot before every backup session. Another easy way is use Mongodb paid backup service. But, that service is expensive (2.3$/GB/month). Soon you will need a replicaset for fault tolerance. With open source version, the nodes can exchanges data only as clear text. For SSL you have to go with Entprise edition. And that is 10,000$. Goodbye Mongodb. Refactoring my code to Cassandra.
R
Richard K.

I've used MongoDB extensively (for the past 6 months), building a hierarchical data management system, and I can vouch for both the ease of setup (install it, run it, use it!) and the speed. As long as you think about indexes carefully, it can absolutely scream along, speed-wise.

I gather that Cassandra, due to its use with large-scale projects like Twitter, has better scaling functionality, although the MongoDB team is working on parity there. I should point out that I've not used Cassandra beyond the trial-run stage, so I can't speak for the detail.

The real swinger for me, when we were assessing NoSQL databases, was the querying - Cassandra is basically just a giant key/value store, and querying is a bit fiddly (at least compared to MongoDB), so for performance you'd have to duplicate quite a lot of data as a sort of manual index. MongoDB, on the other hand, uses a "query by example" model.

For example, say you've got a Collection (MongoDB parlance for the equivalent to a RDMS table) containing Users. MongoDB stores records as Documents, which are basically binary JSON objects. e.g:

{
   FirstName: "John",
   LastName: "Smith",
   Email: "john@smith.com",
   Groups: ["Admin", "User", "SuperUser"]
}

If you wanted to find all of the users called Smith who have Admin rights, you'd just create a new document (at the admin console using Javascript, or in production using the language of your choice):

{
   LastName: "Smith",
   Groups: "Admin"
}

...and then run the query. That's it. There are added operators for comparisons, RegEx filtering etc, but it's all pretty simple, and the Wiki-based documentation is pretty good.


Update (8th August 2011): Amazon's Ireland EC2 data centre had a lightning-related incident last night, and in sorting out our server recovery, I discovered one pretty crucial point: if you've got a replication set of two servers (and they're easy to setup), make sure you have an Arbiter node, so if one goes down, the other one doesn't panic and stall in Secondary mode! Trust me, that's a pain in the behind to sort out with a big database.
to add what @Richard K said, you should have arbiter node when you have even number of nodes (primary+secondary) in a replica set.
Added to that consider mongodb when more aggregation to be done on data analytics.
As long as you think about indexes carefully, it can absolutely scream along, speed-wise. Wait until your physical memory gets full and the OS starts page faulting lol
J
Jason Grant Taylor

Why choose between a traditional database and a NoSQL data store? Use both! The problem with NoSQL solutions (beyond the initial learning curve) is the lack of transactions -- you do all updates to MySQL and have MySQL populate a NoSQL data store for reads -- you then benefit from each technology's strengths. This does add more complexity, but you already have the MySQL side -- just add MongoDB, Cassandra, etc to the mix.

NoSQL datastores generally scale way better than a traditional DB for the same otherwise specs -- there is a reason why Facebook, Twitter, Google, and most start-ups are using NoSQL solutions. It's not just geeks getting high on new tech.


I totally agree. I am using mongodb + mysql in one of the upcoming product that I am architecting. It is an upcoming financial product cloud. mysql is used where we absolutely need transactional capabilities. mongodb is used to store non-computing complex data structures that just need to be pulled up when required. working good so far. :)
I also used such a dual approach in most of my projects, and in some others the NFS mounted file system was used together with PostgreSQL for seismic blobs nearing 1 Gb in some cases. A path is a kind of query to the key value database.
Here is a link to a question I asked about how to architect both sql and nosql databases: dba.stackexchange.com/questions/102053/… I could use some insight you may have
He already has escaped from transactions for good => now infinite scalability might be possible .. otherwise -> not :)
This is not a good solution if your data is distributed
K
Kostja

I'm probably going to be an odd man out, but I think you need to stay with MySQL. You haven't described a real problem you need to solve, and MySQL/InnoDB is an excellent storage back-end even for blob/json data.

There is a common trick among Web engineers to try to use more NoSQL as soon as realization comes that not all features of an RDBMS are used. This alone is not a good reason, since most often NoSQL databases have rather poor data engines (what MySQL calls a storage engine).

Now, if you're not of that kind, then please specify what is missing in MySQL and you're looking for in a different database (like, auto-sharding, automatic failover, multi-master replication, a weaker data consistency guarantee in cluster paying off in higher write throughput, etc).


He is using sharding, which means his data is partitioned manually across servers. Mongodb can automate sharding, which may be a benefit.
He is also storing mostly JSON blobs in RDBMS -- rendering relational design (features) useless.
The data model and automatic sharding are indeed different, but when choosing a database, you need to look at the storage engine first, and the rest of bells and whistles second. How is the storage engine going to perform under a load spike? How is autosharding feature going to perform under a data inflow spike? Before you relinquish control to the database for these important aspects, you'd better make sure it's going to be capable of the task.
Relational model is one of the most well thought-out, efficient to implement and frugal data models out there. "Rendering relational design features useless" may relate to constraints, triggers, or referential integrity - but these all are pay per use.
u
user2066657

I haven't used Cassandra, but I have used MongoDB and think it's awesome.

If you're after simple setup, this is it: You simply untar MongoDB and run the mongod daemon and that's it ... it's running.

Obviously that's only a starter, but to get you started it's easy.


AFAIK, the same applies to Cassandra as well. Untar, run the daemon. The test cluster is setup and ready for production!
G
GrayWizardx

I saw a presentation on mongodb yesterday. I can definitely say that setup was "simple", as simple as unpacking it and firing it up. Done.

I believe that both mongodb and cassandra will run on virtually any regular linux hardware so you should not find to much barrier in that area.

I think in this case, at the end of the day, it will come down to which do you personally feel more comfortable with and which has a toolset that you prefer. As far as the presentation on mongodb, the presenter indicated that the toolset for mongodb was pretty light and that there werent many (they said any really) tools similar to whats available for MySQL. This was of course their experience so YMMV. One thing that I did like about mongodb was that there seemed to be lots of language support for it (Python, and .NET being the two that I primarily use).

The list of sites using mongodb is pretty impressive, and I know that twitter just switched to using cassandra.


At the end of the day it is apples vs oranges comparison. Both the databases have their own strengths. Here are some things to consider - Object model, Secondary indexes, write scalability, high avaialability etc. have a blog post that explains the high level strategic differences between mongodb and cassandra here - scalegrid.io/blog/cassandra-vs-mongodb