A database holds significant importance within a system. Let’s examine the initial stages of a social media platform such as FaceBook.

Presumably, they employed a single database for accessing and modifying data. Everything operates smoothly. However, as the platform gained immense popularity over the course of a few years, the user base grows exponentially.

By the end of 2008, there are approximately 200 million active users. The application’s response time significantly slows down, leading to a poor user experience.

This degradation in performance can be attributed to the database acting as a bottleneck, resulting in high latency and low throughput. To address this issue, it becomes necessary to scale the database. Two potential approaches to accomplish this are database replication and database sharding.

In the context of a social media platform with an enormous volume of user-generated data being produced every second, database sharding emerges as a superior solution to alleviate these challenges.

Real life analogy

Imagine you have a large library with a massive collection of books.

Books range from Fiction, Poetry, Novels, Science, Children's literature, and so forth. The library is very popular and it attracts thousands of readers a day.

As a librarian, it’s a nightmare to manage and retrieve books efficiently, especially if they are not stored in a particular order.

To solve this problem, you decide to separate the library into multiple smaller libraries. Each library is an independent entity with its own collection of books, staff, and resources. Think of the small libraries as shards.

You divide the books among these smaller libraries based on some predefined rule. For instance, you might decide to shard the books based on the first letter of the author’s last name.

All books by authors whose last names start with A-E go to Library 1, F-J to Library 2, and so on. Each library only manages a subset of books, avoiding the burden of handling the entire collection.

Visitors are now directed to the appropriate library based on the book they are looking for. If a visitor wants a book written by an author whose last name starts with C, they are guided to Library 1.

Each library operates independently and manages its own collection. The libraries can handle their visitors’ requests, maintain their shelves, and perform administrative tasks without depending on other libraries. If one library experiences issues or requires maintenance, it doesn’t affect the operations of other libraries.

As the number of books and visitors keeps growing, you can continue to create more libraries and distribute the books accordingly.

In the context of databases, sharding works similarly. The database is divided into multiple shards, each hosted on a separate server or node. Data is distributed among these shards based on a predefined rule or shard key. Each shard manages its own subset of data, and queries or operations are directed to the relevant shard.

What is database sharding?

Database Sharding is the process of storing a large database across multiple machines. It’s the process of separating a table row into multiple table rows known as partitions. Each partition has the same schema and columns.

The data is unique and independent of other partitions. This is an example of horizontal database scaling.

Original User Table

UserID	FirstName	LastName	Gender	Region
1	John	Otieno	Male	Africa
2	Bob	Henry	Male	Europe
3	Mark	Biden	Male	North America
4	Alice	Lee	Female	Asia

Horizontal Partition 1

UserID	FirstName	LastName	Gender	Region
1	John	Otieno	Male	Africa
2	Bob	Henry	Male	Europe

Horizontal Partition 2

UserID	FirstName	LastName	Gender	Region
3	Mark	Biden	Male	North America
4	Alice	Lee	Female	Asia

Sharding involves splitting data into two or more chunks. The shards are then distributed across multiple database nodes. The data held across all the shards cumulatively represent an entire dataset.

In most cases, sharding is implemented on the application level, the application contains logic that defines where to get read or write data from.

In the above example, FaceBook splitting the user tables according to locale should improve data reads/writes significantly. If a request is coming from Africa, the application logic should forward the request to the Shard that contains African data instead of traversing through a single database with billion of user records.

This improves the performance of the system significantly, lower latency, and higher throughput by removing load on a single database. This can further be improved by installing a database node around the region of the users.

Sharding Architectures

Once you have decided to shard your database, The next question is to figure out how you will go about it. When distributing the workload to different shards, it’s crucial that the requests go to the right/intended database. Below are a few common sharding architectures:

Range Based / Dynamic Sharding

Range-based sharding splits database rows of a table based on a range of values. Using this method, you will need to check the user’s name, data will be stored in a shard according to the first alphabet.

Name	Shard Key
Starts with A to I	A
Starts with J to S	B
Starts with T to Z	C

The application will map the shard key to a physical node and store that row on the machine. Range-based sharding is easy to implement, however, it can result in data overloading on a single physical node.

In our example, shard A which contains names that start from A-I might contain a much more number of rows compared to shard C.

Hashed sharding

Hashed sharding makes use of a hashed function(mathematical formula) to assign a shard key to each row of the database. The hash function takes the information of the row and produces a hash value which is used as the shard key.

Name	Shard Value
Bob	1
Alice	2
Chad	1
Diana	2

Hashed functions are best when you want to distribute data evenly among the physical shards. However, it does not separate the database according to meaningful information.

Directory Sharding

Directory sharding uses a lookup table to match database information to a shard. A lookup table contains a static set of information about where specific data can be found.

Company Name	Shard Key
Company A	A
Company B	B
Company C	C

Each shard is a meaningful representation of the database and is not limited by ranges. However, directory sharding fails if the lookup table contains the wrong information.

Geo sharding

Geo-sharding splits and store information according to the user's geographical location. In the FaceBook example, users’ information can be stored According to the user region for example Africa, Europe, Middle East, Asia, North America, and so forth.

Geo sharding greatly improves the user experience as it’s able to respond to user requests faster because of the distance between the user and server. However, it can also result in uneven distribution of data.

Benefits of database sharding

Improved Performance and Scalability - Sharding enables horizontal scaling, allowing the database to handle larger amounts of data and higher workloads. By distributing data among multiple shards, each shard can handle a subset of the overall data and workload. This parallel processing capability improves the system’s overall performance and scalability.
Enhanced Availability and Fault Tolerance - Sharding provides better fault isolation and availability compared to a single monolithic database. If one shard or database system fails, the other shards remain operational, ensuring that the system can continue to function. Sharding also allows for distributed data replication and backup strategies, further enhancing fault tolerance and data availability.
Geographical Distribution and Localized Data: Sharding facilitates the distribution of data across different geographical regions. This can be beneficial for applications with a global user base, as it enables data to be stored closer to the users, reducing latency and improving the user experience. Sharding also allows for compliance with data sovereignty regulations, as data can be stored in specific regions based on user location.

Drawbacks of Database Sharding

Increased Complexity - Sharding involves distributing and coordinating data across multiple shards, which can be challenging to set up and maintain.
Data Skew and Hotspots (celebrity problem) - Some shards may experience higher loads or data skew, while others remain underutilized. For example, in the FaceBook example, you might find a shard containing data of celebrities more than others. This shard will experience more traffic compared to others as users request the profiles of the celebrities. You will have to manage and rebalance data across the shards to for an even workload distribution.
Schema Changes: Making schema changes or altering data structures across multiple shards can be more challenging than in a single-database system.

Alternatives to database sharding

Sharding is a horizontal scaling strategy that allocated additional nodes to distribute the user traffic. The major advantage of scaling outwards is that you are not limited by hardware and also its fault-tolerance architecture. When one node fails, others will continue to operate without your application experiencing any downtime. However, sharding is one of the methods of database scaling, you can explore other alternatives such as:

Vertical scaling(Scaling up) - Vertical Scaling increases the computing power of the database server. For example by adding a better CPU, more RAM, and storage to handle increasing traffic. Vertical scaling is less costly but there is a limit to the hardware resources you can upgrade.
Database Replication - This technique makes the exact copy of the database and stores them across different nodes. Unlike replication, sharding does not create exact copies of the information. It splits a database into multiple parts and stores them in multiple computers. Sharding can be used together with replication to achieve higher scalability and availability. More details about database replication can be found here.
Database Partitioning - Partitioning is the process of splitting a database table into multiple groups. Partitioning can be classified into two types:
- Horizontal Partitioning (Sharding) - The database is split into rows.
- Vertical Partitioning - The database is split into columns.

Original User Table

UserID	FirstName	LastName	Gender	Region
1	John	Otieno	Male	Africa
2	Bob	Henry	Male	Europe
3	Mark	Biden	Male	North America
4	Alice	Lee	Female	Asia

Vertical partition 1

UserID	FirstName	LastName	Gender
1	John	Otieno	Male
2	Bob	Henry	Male
3	Mark	Biden	Male
4	Alice	Lee	Female

Vertical partition 2

UserID	Region
1	Africa
2	Europe
3	North America
4	Asia

Partitioning stores all data groups in the same computer, but database sharding spreads them across different computers.

Should I Shard My Databases?

When your application grows in size, there is a need to scale. The use of a Sharded database architecture can solve your problems but some see it as a headache that should be avoided unless when it’s absolutely necessary. Below are some of the scenarios where it may be beneficial to shard a database:

Enormous amount of data - When you are dealing with extremely large data, it’s advisable to split the data into multiple shards to remove performance bottlenecks.
Isolation of Different Customer Data - In multi-tenant applications where data from different organizations need to be isolated. For example in applications such as Office 365 sharding the databases is an effective method. Each shard can be dedicated to a specific organization providing data separation, privacy, and security.
Geographic Distribution - an application or service has a global user base, sharding can be useful for distributing data closer to users in different regions. Each shard can be located in a specific geographical region, reducing data access latency and improving the user experience.

Conclusion

It’s important to note that the decision to shard a database should be based on careful analysis of the specific requirements, data patterns, and expected growth. Implementing database sharding requires thorough planning, design, and ongoing management to ensure the desired benefits are achieved effectively.

Unlock Your Python Backend Career: Build 30 Projects in 30 Days. Join now for just $54

Database Sharding