RDBMS & NoSQL Database Scaling
There are a huge amount of data is being generated every day by billions of people and they are stored in systems called databases. With a lot of data, the incoming traffic to those database systems are rapidly increasing.
Because of the huge number of traffic incoming to database systems, it is getting harder and harder to maintain the availability and performance high. There is a severe need of scaling the systems to make them highly available and high performant.
This is a brief introduction about the general database scaling concepts . It will only scratch the surface of a vast and complicated subject.
A database needs to always be available, high performant and needs to minimally latent. The big companies with huge number of data can provide that with scaling their databases. There are mainly two types of scaling concepts that can be introduce on database system. Vertical Scaling and Horizontal Scaling.
Vertical Scaling (Scale Up)
Let’s say that you bought 10 oranges and a small basket to put them in from the market. When you got back home, you realized your wife also got 10 oranges but has no basket. Would you buy another small basket to put other 10 in or buy a bigger basket to put all 20 in?
If you consider getting a bigger basket to put all 20 of them in, you went for scaling vertically.
Let’s examine more technical example. You have a small company and you have a database server with 25GB of RAM to store all the data in. After some time, your company got bigger and your data got bigger with it. Now you database gets overwhelmed with the traffic incoming.
In such a scenario, in order to scale your database system vertically, all you need to do is to get a bigger and faster database server.
- Since there is only a single node of database, it is easy to implement.
- The bigger is the server, the faster it gets. (performance-wise)
- It gets more expensive as your data largen.
- Difficult to perform parallel operations.
- Creates some risky situations when database server fails.
Let’s remember the orange — basket case from above. You have 20 oranges in total and a small basket that can store 10 oranges at a time. You choose not to get a bigger basket with the capacity to store 20 oranges. Instead, you go and purchase another small basket. You now have two small baskets each can store 10 oranges. This is called horizontal scaling.
Now, we can examine the “more technical” scenario from before with this approach in mind. You have your company and you have a database server with 25GB of RAM to store all the data in. Your needs changed as before. But this time you chose not to buy a bigger database server, you chose to increase the number of servers to compensate the incoming traffic.
- It is always the cheaper option compared to vertical scaling.
- Has a better fault tolerance. Therefore the downtime would be significantly lower.
- Easy to perform parallel tasks.
- More difficult to implement. Keeping the consistency between nodes is hard.
- Not always ideal for RDBMS since the relations between data is difficult when the data is scattered over DB nodes.
It is time to examine both scaling concepts and techniques further for both SQL and NoSQL databases.
Scaling SQL Database Systems
The Relational Databases by their very nature strictly follow the ACID properties. (You can find further information about ACID here: https://en.wikipedia.org/wiki/ACID)
In computer science, ACID (atomicity, consistency, isolation, durability) is a set of properties of database transactions intended to guarantee data validity despite errors, power failures, and other mishaps. In the context of databases, a sequence of database operations that satisfies the ACID properties (which can be perceived as a single logical operation on the data) is called a transaction. For example, a transfer of funds from one bank account to another, even involving multiple changes such as debiting one account and crediting another, is a single transaction.
The power of RDBMS’s comes from making relations between tables and always stay consistent. Hence(keeping ACID rules in mind) RDBMS’s are mostly scaled vertically. (More traffic equals bigger server!)
That does not mean that you cannot scale RDBMS’s horizontally. In order to perform horizontal scaling RDBMS’s are replicated with a method called Master-Slave Replication. Multiple Slave replicas has one master node. The true data is kept at the master node, thus writing only occurs there. The master performs and logs the writes and then passes the updated information to the slaves(replicas). Each slave node outputs a message stating that it has received the update successfully, thus allowing the sending of subsequent updates. Slave nodes can perform read actions any time to reduce the load on the master node. When the master node is down due to a failure, one of the slave nodes become the master node.
While the SQL database servers can scale horizontally like above, the tables can be scaled across those nodes as well. This operation is called Partitioning.
Partitioning is the database process where very large tables are divided into multiple smaller parts. By splitting a large table into smaller, individual tables, queries that access only a fraction of the data can run faster because there is less data to scan.
Vertical partitioning is a kind of partitioning that divides a table vertically. You divide a table logically and can access that specific part at once. It can be used in such scenarios when you don’t need to access all the data of a table.
Horizontal Partitioning (Sharding)
Horizontal partitioning is a kind of partitioning that divides tables horizontally. Horizontal partitioning divides a table into multiple tables that contain the same number of columns, but fewer rows.
For example, let’s say there is a big company data ordered by timestamps and the data is used for some analytical algorithm. If the data is commonly retrieved by its year, then it can easily be partitioned into more tables by years.
Tables are horizontally partitioned based on a column which will be used for partitioning and the ranges associated to each partition. Partitioning column is usually a datetime column but all data types that are valid for use as index columns can be used as a partitioning column, except a timestamp column. The ntext, text, image, xml, varchar(max), nvarchar(max), or varbinary(max), Microsoft .NET Framework common language runtime (CLR) user-defined type, and alias data type columns cannot be specified.
Scaling NoSQL Database Systems
We talked about vertical and horizontal scaling in general and how to apply them onto SQL database systems so far. The next thing is how they are applied onto NoSQL database systems and which one is better on which scenarios.
Vertical scaling is the same. If you want to scale up, you need to get bigger servers. There is no magic here. The real challenge is to scale database systems horizontally. Which is almost always a better option when there are big data in hand.
Although horizontal scaling is almost always better than vertical scaling in most of the cases, we know that it is harder and riskier to apply on to database system because of the CAP theorem. You can find information about CAP theorem here: https://en.wikipedia.org/wiki/CAP_theorem
In theoretical computer science, the CAP theorem, also named Brewer’s theorem after computer scientist Eric Brewer, states that it is impossible for a distributed data store to simultaneously provide more than two out of the following three guarantees:
Consistency: Every read receives the most recent write or an error
Availability: Every request receives a (non-error) response, without the guarantee that it contains the most recent write
Partition tolerance: The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes
We are in luck though. Because NoSQL systems in their very nature are designed to scale horizontally. They are designed to address the scalability requirements that an ordinary SQL cannot address. Since they follow the de-normalization concept, there is no necessity for a single point of truth. It can store duplicates.
Due to its architectural design, NoSQL database systems relies upon a softer model known as the BASE model instead of ACID model mentioned above in SQL part. You can find more information about BASE model here.
Basically Available: Guarantees the availability of the data . There will be a response to any request (can be failure too).
Soft state: The state of the system could change over time.
Eventual consistency: The system will eventually become consistent once it stops receiving input.
NoSQL databases give up the A, C and/or D requirements, and in return they improve scalability.
Like RDBMS, replication is also used to scale NoSQL Database Systems horizontally. Since there is no single point of truth, in order to use master — slave architecture like in SQL, NoSQL systems use variety of different replication methods.
I will not explain each NoSQL database servers and their replication methods because It would make the article too long (since the main purpose of this article to introduce basic concepts of database scaling as plain as possible) instead I will list some of the commonly used NoSQL databases and their replication methods below.
Maybe the most common NoSQL database is MongoDB. You can find how MongoDB replicates its nodes to scale the database horizontally here: https://docs.mongodb.com/manual/replication/
A replica set in MongoDB is a group of
mongodprocesses that maintain the same data set. Replica sets provide redundancy and high availability, and are the basis for all production deployments. This section introduces replication in MongoDB as well as the components and architecture of replica sets. The section also provides tutorials for common tasks related to replica sets.
Another widely used NoSQL database is Apache’s famous Cassandra. You can find its replication method here.
Couchbase is a both a document-typed and key-value typed NoSQL database and it has some different replication methods that you can find interesting.
This article is already longer that I intended it to be so I will not go in more detail about other specific databases. I hope this has been very helpful. Stay healthy.