ChatGPT解决这个技术问题 Extra ChatGPT

Why are document stores like Lucene / Solr not included in NoSQL conversations?

All of us have come across the recent hype of no-SQL solutions lately. MongoDB, CouchDB, BigTable, Cassandra, and others have been listed as no-SQL options. Here's an example:

http://architects.dzone.com/articles/what-nosql-store-should-i-use

However, three years ago a co-worker and I were using Lucene.NET as what seem to fit the description of no-SQL. We did not use it just for user-inputted search queries; we used it to make a few reindexed RDBMS table data extremely performant. We implemented our own .NET sort-of-equivalent-to-Solr service to manage these indexes and make them callable. When I left the company, the team switched to Solr itself. (For those not in the know, Solr is a web service that wraps Lucene with REST-callable queries and index dumps.)

What I don't understand is, why is Solr not counted in the typical lists of no-SQL solution options? Am I missing something here? I assume that there are technical reasons why Solr is not comparable to the likes of CouchDB, etc., and in fact I understand that CouchDB uses Lucene as its data store (yes?), but what disqualifies Solr?

I'm not asking as some kind of Solr fanboy or anything, I just don't understand why Solr and the like don't fit the definition of no-SQL, and if Solr technically does fit the definition then what about it likely makes people pooh-pooh it? I'm asking because I'm having difficulty determining whether I should continue using Lucene-based solutions (like Solr) for solutions that I build or if I should really do more research with these other options.


B
Bill Karwin

I once listened to an interview with author Ursula K. LeGuin about fiction writing. The interviewer asked her about authors who work in different genre of writing. What makes one author a romance writer, and another a mystery writer, and another a science fiction writer? LeGuin responded by explaining:

Genre is about marketing, not about content.

It was an eye-opening statement.

I think the same applies to technology solutions. The NoSQL movement is attracting attention because it's full of marketing energy right now. NoSQL data stores like Hadoop, CouchDB, MongoDB, have commercial ventures backing them, pushing their solutions as new and innovative and exciting so they can grow their business. The term "NoSQL" is a marketing brand that helps them to explain their value.

You're right that Lucene/Solr is technically very similar to a NoSQL document store: it's a denormalized bag of documents (their term) with fields that aren't necessarily consistent across the collection of documents. It's indexed in a sophisticated way to allow you to search across all fields or by specific fields.

But that's not the genre Lucene uses to explain its value. They don't have the same mission to grow a market and a business, since they're managed by the Apache Foundation. They're happy to focus on the use case of fulltext search, even though the technology could be used in other ways. They're following a tenet of software success: do one thing, and do it well.


Good thoughts, thumbs-up. But CouchDB is an Apache project like Solr is, and Solr is used by a lot of commercial scenarios such as CNET. So with your logic regarding commercial ventures vs. Apache, other than the up-front messaging (i.e. "faced searching" rather than "indexed column/value stores") I still don't see why Solr isn't treated the same in the no-SQL space.
CouchDB is supported by commercial enterprises Couchio and Cloudant. Damien Katz is the primary architect of CouchDB and he's the founder and CEO of Couchio. He just happens to grant his code to the Apache Foundation.
RavenDB uses Lucene extensively, IIRC
J
Jon Davis

After doing more Google-searching, I think this document sums it up pretty well:

https://web.archive.org/web/20100504055638/http://www.lucidimagination.com/blog/2010/04/30/nosql-lucene-and-solr/

Case in point, Lucene/Solr is NoSql and could be considered one of NoSql's more mature "forefathers". It just does not get the NoSql hype it deserves because it didn't invent the term "no-SQL" and its users don't use the term, so the hype machine overlooked it.


Check out MUMPS for a real NoSQL forefather! en.wikipedia.org/wiki/MUMPS
As NoSQL is usually interpreted as "Not Only SQL", MUMPS came in an era where it could not complement SQL at all. However, kudos for the reference and "nostolgia" (although this is way before my time).
Here's another document-oriented database that dates back to 1989: en.wikipedia.org/wiki/Lotus_Notes#Database I'm sure it's no coincidence that Damien Katz also worked on Lotus Notes at IBM.
Berkeley DB and ESENT could be also considered NoSQL.
The link above is not working any more, the current link probably is lucidworks.com/blog/nosql-lucene-and-solr
J
Jokin

I think that the most relevant characteristic of solr/lucene that drops from the nosql list it's because until recently, making lucene work as a real-time system was a pain. The usual workflow for any performant application was to index the incremental updates in batchs, and updating the index every 5 minutes for example.


C
Community

I think that stimpy77 is partly right on the NoSQL being a branding thing. But also, NoSQL means that it's a data storage platform that is simpler/easier then SQL based solutions. And I think while Solr/Lucene share some aspects (they store data), it really misses the mark to think that Solr/Lucene could be used as primary data storage for anything that has relationships. Sure, lots of documents can be thrown into it, and powerful search pull them back. But as soon as you want relationships, then others such as CouchDB and others do much better that have a query syntax of some kind. Search is a bandaid solution in that case. Think about the use case "find all documents tagged with word 'car'". If I have some structures in my data, then it's easy for me to get the document for tag car, and pull everybody back. Versus relying on a search query that includes fq=tag:'car'. Search is more and more powerful the fewer relationships you have, but the more relationships, the better a datastore like CouchDB and brethren are. Thats why you still see CouchDB and friends paired with Solr, and vice versa! Let each one do what it does best.

Of course, that isn't to say you can't leverage storing your source data in Solr, that can be a powerful tool to use!


"I think that stimpy77 is partly right on the NoSQL being a branding thing." I think this credit goes to Bill Karwin. Thanks tho.
Regarding your point, several of the opinioned definitions of "NoSQL" are that it specifically deemphasizes relational integrity. Does BigTable support relational data? Does Cassandra? Granted, relational is nice, but it is surely not part of the definition most people agree on of NoSQL. Solr, on the other hand, does support "faceted searching" which is sort of an abstract approach to many-to-many-to-many-to-many kinds of filtering. Filtering is not relational data but it can assist in subqueried virtual joins.
"[NoSQL is] a data storage platform that is simpler/easier then SQL based solutions" Uh, not hardly. It's just DIFFERENT. Especially when you get into distributed systems, lack of consistency, and non-ACID storage, "simple" and "easy" are some of the first things you lose.
G
Gokul Muralidharan

The main differences between a no sql and solr in operational wise are the following in my opinion.

Solr requires an intermediate data store (database or XML files) whereas nosql itself a straight data store. You cannot do a constant writes to solr (solr 4.0 seems to bring that support) and you can only index at the max of every 2 mins and 200 records (which is very slow for high throughput writes and you are forced for an intermediate storage). You are require to change / define the schema when you alter what is stored in document. NoSQL has no such definitions. Solr indexes has performance implication when its index size grows whereas NoSQL is optimized for it (or claims to be :) ) Solr has underlying lucene search algorithms bundled but in NoSQL you need to build them, This applies to the magnificent faceted search or blazing fast document search provided by solr.


When people mark someone's answer down I wish they could say why. This answer has 5 points and I consider some correct and others not. But I'm no expect so would like it confirmed which are right and which are wrong.
Point 1: Solr (Lucene) is a primary data store. There is nothing intermediate. If you want to, you can use it as a system of record. Most people don't because its strength is in search. Point 2: There are people that are constantly indexing and doing commits once a second, or even faster. Where did 2 minutes / 200 records come from? Point 3: True, in this way it is not as flexible as other software. Point 4: Because Solr is designed around search, it must have data in RAM for good performance. OS disk cache rules. Low RAM makes it slow. Point 5: Yes. Solr does search, does it well.
Hello @Elyagrag Please find my responses below, Appreciate your inputs here. Point 1: My point here was that you need to rebuild the whole document again and then submit that for processing which qualifies for an intermediate build (I agree you can avoid the store),My experiences with Solr was always to have a external data store which would help to re-index the data in case of the corrupt index. Point2:For performance reasons generally the commits are delayed and I agree you can change it to commit at very lower intervals but that affects performance on high transactions sites.
Another thing to add to my point 1 is that when you need to add a field (which happens always) you need to reindex and thats is where you would need intermediate data store. see blog.michaelhamrah.com/2011/11/…
V
Viswanath Lekshmanan

Last but few points, Its about the difference not the one mentioned here as marketing strategy in which solr goes out from NoSQL

Lucene/Solr - Iam gonna use Solr, Since Solr uses lucene internally and has addition features. So Solr is basically an upgrade to Lucene with new constume.

Solr is mainly used for purpose to create facets and indexing plain texts for search engine.

Solr can use most of the databases to store its data. It is inconsistent to keep data in solr since it directly use disks.

NoSQL databases are easy to learn compared to Solr. Solr is more or less having lot of configurations and concepts (For eg: Fields).

Performance is something that we have to consider b/w . Solr provides high performance compared to other NoSQL databases.

Note: Combining the Solr with some databases provides the best performance.

Summary: Solr is also a NoSQL datastore which is a predecessor of all NoSQL databases. Which didn't get the hype of others. But still in the field due to its performance and power.


关注公众号,不定期副业成功案例分享
Follow WeChat

Success story sharing

Want to stay one step ahead of the latest teleworks?

Subscribe Now