Massive Technical Interviews Tips: Couchbase

http://guide.couchdb.org/draft/conflicts.html

You now have a document in each of the databases that has different information. This situation is called a conflict. Conflicts occur in distributed systems. They are a natural state of your data. How does CouchDB’s replication system deal with conflicts?

When you replicate two databases in CouchDB and you have conflicting changes, CouchDB will detect this and will flag the affected document with the special attribute "_conflicts":true. Next, CouchDB determines which of the changes will be stored as the latest revision (remember, documents in CouchDB are versioned). The version that gets picked to be the latest revision is the winning revision. The losing revision gets stored as the previous revision.

CouchDB does not attempt to merge the conflicting revision. Your application dictates how the merging should be done. The choice of picking the winning revision is arbitrary. In the case of the phone number, there is no way for a computer to decide on the right revision. This is not specific to CouchDB; no other software can do this (ever had your phone’s sync-contacts tool ask you which contact from which source to take?).

Replication guarantees that conflicts are detected and that each instance of CouchDB makes the same choice regarding winners and losers, independent of all the other instances. There is no group decision made; instead, a deterministic algorithm determines the order of the conflicting revision. After replication, all instances taking part have the same data. The data set is said to be in a consistent state. If you ask any instance for a document, you will get the same answer regardless which one you ask.

Whether or not CouchDB picked the version that your application needs, you need to go and resolve the conflict, just as you need to resolve a conflict in a version control system like Subversion. Simply create a version that you want to be the latest by either picking the latest, or the previous, or both (by merging them) and save it as the now latest revision. Done. Replicate again and your resolution will populate over to all other instances of CouchDB. Your conflict resolving on one node could lead to further conflicts, all of which will need to be addressed, but eventually, you will end up with a conflict-free database on all nodes.

How Does CouchDB Decide Which Revision to Use?

CouchDB guarantees that each instance that sees the same conflict comes up with the same winning and losing revisions. It does so by running a deterministic algorithm to pick the winner. The application should not rely on the details of this algorithm and must always resolve conflicts. We’ll tell you how it works anyway.

Each revision includes a list of previous revisions. The revision with the longest revision history list becomes the winning revision. If they are the same, the _rev values are compared in ASCII sort order, and the highest wins. So, in our example, 2-de0ea16f8621cbac506d23a0fbbde08a beats 2-7c971bb974251ae8541b8fe045964219.

One advantage of this algorithm is that CouchDB nodes do not have to talk to each other to agree on winning revisions. We already learned that the network is prone to errors and avoiding it for conflict resolution makes CouchDB very robust.

http://developer.couchbase.com/documentation/server/current/introduction/intro.html
https://www.quora.com/What-is-the-difference-between-couchbase-and-mongodb
Couchbase (not to be confused with couchdb) and MongoDB are both document oriented databases. They both have a document as their storage unit.

That is pretty much where the similarties stop.

Couchbase is a combination of couchdb + membase. It uses a strict HTTP protocol to query and interact with objects. Objects (documents) are stored in buckets.

To query documents in Couchbase, you define a view with the columns of the document you are interested in (called the map); and then optionally can define some aggregate functions over the data (the reduce step).

If you are storing customer data and want to query all customers that have not bought any goods for the past three months; you would first have to write a view (the map) that filters these customers; once this view is published - couchbase will optimize searches on this and you can use this view (map) as your source on which you execute queries.

You can create multiple views over your documents and these views are highly optimized by the system and are only reindexed when the underlying document has significant changes.

This makes couchbase ideal for those situations where you have infrequent changes to the _structure_ of your document; and know in advance what are the kinds of queries you will be executing. You can think of dashboards, realtime updates, etc.

It also offers excellent support for offline databases and built-in master-master replication; making it a good candidate for mobile and other occasionally connected devices.

MongoDB has an entirely different approach to the same problem.

It has a concept of SQL-like queries, and databases and collections.

In MongoDB documents live in a collection, and collections are part of a database.

Just like Couchbase, you can store any arbitrarily nested document; and just like Couchbase an automatic key is generated for you.

However, with MongoDB the way you retrieve documents is more like how you write SQL queries; there are operators for most boolean matches, and pattern matching and (with 3.0) full text search as well. You can also define indexes to help speed up your results.

In this respect, MongoDB is easier to get familiar with if you are already comfortable with traditional SQL.

MongoDB also provides the normal replication capabilities and it is capable of master-master replication (although such a configuration is not enabled by default).

MongoDB can most easily replace your traditional relational database needs; as it has the same concepts of keys/tables ("collections") and query parameters - along with the benefit of being schema-free.

Couchbase and MongoDB both provide commercial support for their databases - MongoDB's commercial offering is called MongoDB Enterprise and Couchbase has Enterprise Edition (EE).

One difference you'll immediately find between MongoDB and Couchbase is that MongoDB does not come with a default administration console/GUI - in fact a GUI and a complete hosted management service is offered as a pay option.

You can install any number of third party GUI to quickly browse your documents; but having one by default would have been nice.

Couchbase provides and excellent GUI with their free product.

http://www.slideshare.net/gabriele.lana/couchdb-vs-mongodb-2982288
> use checkout
switched to db checkout
> db.tickets.save({ "_id": 1, "day": 20100123, "checkout": 100 })
> db.tickets.save({ "_id": 2, "day": 20100123, "checkout": 42 })
> db.tickets.save({ "_id": 3, "day": 20100123, "checkout": 215 })
> db.tickets.save({ "_id": 4, "day": 20100123, "checkout": 73 })
> db.tickets.count()
4
> db.tickets.find()
{ "_id" : 1, "day" : 20100123, "checkout" : 100 }
...
> db.tickets.find({ "_id": 1 })

> var map = function() {
... emit(null, this.checkout)
... }
> var reduce = function(key, values) {
... var sum = 0
... for (var index in values) sum += values[index]
... return sum
... }
> sumOfCheckouts = db.tickets.mapReduce(map, reduce)
{
"result" : "tmp.mr.mapreduce_1263717818_4",
"timeMillis" : 8,
"counts" : { "input" : 4, "emit" : 4, "output" : 1 },
"ok" : 1
}
> db.getCollectionNames()
[
"tickets",
"tmp.mr.mapreduce_1263717818_4",
]
> db[sumOfCheckouts.result].find()
{ "_id" : null, "value" : 430 }
> db.tickets.mapReduce(map, reduce, { “out”: “sumOfCheckouts” })
> db.getCollectionNames()
[
“sumOfCheckouts”,
"tickets",
"tmp.mr.mapreduce_1263717818_4"
]
> db.sumOfCheckouts.find()
{ "_id" : null, "value" : 430 }
> db.sumOfCheckouts.findOne().value
430
# GROUP AS MAP/REDUCE ALTERNATIVE
> db.tickets.group({
... "initial": { "sum": 0 },
... "reduce": function(ticket, checkouts) {
...... checkouts.sum += ticket.checkout
...... }
... })
[ { "sum" : 430 } ]

> db.tickets.update({ "_id": 1 }, {
... $set: { "products": {
...... "apple": { "quantity": 5, "price": 10 },
...... "kiwi": { "quantity": 2, "price": 25 }
...... }
... },
... $unset: { "checkout": 1 }
... })

> var map = function() {
... var checkout = 0
... for (var name in this.products) {
...... var product = this.products[name]
...... checkout += product.quantity * product.price
...... }
... emit(this.day, checkout)
}
> var reduce = function(key, values) {
... var sum = 0
... for (var index in values) sum += values[index]
... return sum
}

> db.tickets.mapReduce(map, reduce, { "out": "sumOfCheckouts" })
> db.sumOfCheckouts.find()

> var map = function() {
... var checkout = 0
... for (var name in this.products) {
...... var quantity = this.products[name]
...... var price = db.product.findOne({ "_id": name }).price
...... checkout += quantity * price
...... }
... emit(this.day, checkout)
}
> var reduce = function(key, values) {
... var sum = 0
... for (var index in values) sum += values[index]
... return sum
}

> db.tickets.mapReduce(map, reduce, { "out": "sumOfCheckouts" })

Count of unique elements
> var map = function() {
... var accumulator = {
...... "numberOfViews": 1,
...... "visitedPages": {},
...... "totalTime": 0
...... };
... accumulator["visitedPages"][this.page] = 1
... accumulator["totalTime"] += this.time
... emit(this.user, accumulator)
}
> var aUser = db.view.findOne({ "user": "001" })
> var emit = function(id, value) { print(tojson(value)) }
> map.call(aUser)

Learning Couchbase
Data manager
Any operation performed on the Couchbase database system gets stored in the memory, which acts as a caching layer. By default, every document gets stored in the memory for each read, insert, update, and so on, until the memory is full. It's a drop-in replacement for Memcache. However, in order to provide persistency of the record, there is a concept called disk queue. This will flush the record to the disk asynchronously, without impacting the client request.

The cluster manager is responsible for node administration and node monitoring within a cluster. Every node within a Couchbase cluster includes the cluster manager component, data storage, and data manager. It manages data storage and retrieval. It contains the memory cache layer, disk persistence mechanism, and query engine.

Couchbase clients use the cluster map provided by the cluster manager to find out which node holds the required data, and then communicate with the data manager on that node to perform database operations.

A bucket is an independent virtual container that groups documents logically in a Couchbase cluster, which is equivalent to a database namespace in RDBMS

In Couchbase, a bucket is the equivalence of a database, but there is no concept of tables in Couchbase. In Couchbase, all data or records, which are referred to as documents, are stored directly in a bucket. Basically, the lowest namespace for storing documents or data in Couchabase is a bucket.

Types of bucket
Memcached
Couchbase

One way is that the key marked for deletion is removed when there is a request for that particular key. This is called Lazy Deletion. The other way is if the keys are flagged as expired, then they will be removed by an automatic maintenance process that runs according to the maintenance intervals. By default, this takes place every 60 minutes.

Depending on the document ID, documents are distributed across the nodes in a cluster. Each bucket is divided into 1024 logical partitions which are called vBucket. Each partition is bound to a particular node in the cluster. This bindings of vBucket to server nodes is stored in a cluster map, which is a lookup structure. Each vBucket will have a subset of document IDs. This mechanism allows effective distribution and sharding of documents across the nodes in a cluster.

Whenever there is a failure of one of the nodes in the cluster, replica vBuckets are converted to an active vBucket in place of the vBuckets that failed because of the node failure. This process occurs instantaneously.

Views enable indexing and querying by looking inside JSON documents for a key, for ranges of keys, or to aggregate data.
Views are created using incremental MapReduce, which powers indexing

Views enable us to define materialized views on JSON documents and then query across the dataset.
A materialized view is a database object that contains the result of MapReduce.

Views are created for documents that are stored on the disk only. Hence, sometimes, there will be some documents missing in the views, which are mostly those documents that are in the RAM and that have not yet spilled on disks.

Types of views
DEVELOPMENT
PRODUCTION
Create design documents on a bucket.
Create views within that design document.
For manageability, all views are attached to a design document. So, whenever you want to write a view, you need to create one design document to attach the view into it. A design document can have multiple views. However, whenever any changes take place in one of the views' definitions in a single design document, all views that belong to that design document are rebuilt. This will increase the I/O and CPU usage across the Couchbase cluster. Hence, ensure that you group views in a design document that are not going to change its view's definition often.
In reduce function, when the rereduce parameter is true, it indicates that the reduce function is again called as part of re-reduce.

Friday, August 26, 2016

Couchbase

Labels

Popular Posts