In this Blog, we discuss the steps involved in migrating data from SQL to NoSQL database (with reference to MongoDB), and dive more into the data modeling process and why it is important. In the following Blog, we will be discussing each step of the data migration process in detail.
Introduction
The explosion of data has revolutionized the database landscape and the need to store all the data led to innovative solutions, like the cloud that has eliminated the need to worry about the storage space and management of such a huge amount of data. Another challenge emerged – the diverse nature of data. Traditional relational databases, with their rigid, fixed schemas, struggled to accommodate unstructured or semi-structured data. This was solved by NoSQL databases, like MongoDB, that offer a flexible schema, allowing data to be stored in its native form, regardless of its structure.
If your business is planning to migrate data from a relational database to a non-relational database, the first question would be – why would you want to migrate from a relational database to a non-relational database.
For that, let us answer the following questions:
1. Is my data relational?
A relational data is the one that defines relationships between various entities. For example, an order would contain customer details and product details, i.e. it has a direct relationship with customer and product data. If your data has a lot of relationships, you should rethink how you want to store the data as your data grows.
2. Are the related data in my database often accessed together?
Relational databases are known to follow normalization, a process in which relational databases split the different but related entities so that they can be stored separately. For example, instead of storing the order, customer and product details together, normalization allows to create 3 separate entities (represented as tables) for order, customer and product. Whenever the information is required together, the three entities have to be joined through a query to view the data in a single view. Normalization is done to avoid data duplication and redundancy.
However, if your entities are often accessed together and it does not make sense to normalize the data, it is more efficient to store the data together.
3. Do I expect a lot of schema changes?
Often today’s applications are quite dynamic and requirements change often to suit evolving customer demands. If your application needs to evolve as per market demands, a traditional relational database may not be a good fit, as it is difficult to make schema changes once the design stage is crossed.
4. Can I move beyond ACID?
ACID stands for Atomicity, Consistency, Isolation, and Durability. In a distributed system, ensuring strict ACID compliance can be challenging, as it requires coordination across distributed nodes. By relaxing ACID properties, NoSQL databases can continue to operate efficiently even in the presence of network failures or delays. NoSQL databases often prioritize performance, scalability, and flexibility. Depending on your priorities for the application, you may choose to migrate or not migrate your database.
There might be many other questions, however the above are some basic points to consider before you think about migration.
How data is stored in a relational database
Further to our discussion in the section above, let us see how a relational database stores data. For this, let us take an example of a blog post on a web page. A blog post will have its own attributes like blog_id, date_published, time_to_read, blog_author, blog_comments and so on. However, one author can write multiple blogs. Similarly, one blog can receive multiple comments. A traditional database would typically design the model in a way such that the entities blog_post, comments, and author are separated from each other, as each can have its own attributes. For example, comments will have description, date, username etc. Similarly the author will have details like author_name, interests, date_joined, blogs_written and so on.
This is a form of normalization, where we are splitting the main entities (as tables) to represent data in a neat and clear structure, without duplication and redundancy.
If we want to access the details of the author who wrote the blog post and the comments received, we will have to scan through each row and find the matching foreign key reference.
There might be many comments on a particular blog post and multiple authors may have co-authored it. The above strict schema does not cater to the scenario where there are multiple authors for the same blog. Also, if we need to add any media (like a video) to the blog post, that would require a schema change!
How data is stored in a NoSQL (document) database
In a NoSQL database, data that will be accessed together, can be stored together. In our case, let’s take the example of MongoDB, a document database. All the related data is embedded in the same document. For example, blog_post can have an object named author_details, which can have details of all the authors who contributed to the article. Tomorrow, if you add a media to the blog_post, you simply need to update the document to add the field - it is not necessary that all the blog posts on the page should have that field! Also, let’s say there are 50 comments on the blog post. You wouldn’t want to populate all the 50 while loading the page. However, you can filter the latest few comments and display them - MongoDB provides several design patterns to embed part of data, while keeping the rest of it separate.
Steps to migrate data from SQL to NoSQL
Migration from an SQL to NoSQL database requires a series of steps:
Modeling the data
The first step is to model the data. For this, you need to understand the existing relationships between various entities, constraints and data types, and the overall schema. In the above example, we have shown a simple relationship with 3 tables, however a production system could have hundreds to tables with complex relationships with each other. All those need to be considered to create a new model for the NoSQL database. The next step is to decide which NoSQL database you want to use - document-based, graph, key-value etc. Based on that, you can map your data. For example, if you choose to use MongoDB, you can map a table to a collection, with its columns being mapped to fields of the collection and a row, as a document.
Extract/export data
Once you have mapped the columns to their corresponding fields, you can extract and export data. The data can be extracted in any format like CSV, XML, JSON etc. Databases like MongoDB provide tools like the relational migrator to easily migrate from SQL to NoSQL.
Transform data
Transforming data involves deciding whether you want to denormalize the normalized data and to what extent you want the data to be denormalized. MongoDB provides two ways to store data - embedding and referencing. We have discussed part of it in the previous section on modeling the data. While embedding has several benefits including significant improvements in read performance, it can introduce a lot of data duplication and redundancy. In some cases, where the reads are not too often, you can choose to keep the data normalized and reference it as required.
Transforming also involves checking if the new database has all the data types available corresponding to the ones in the previous database. For example, MongoDB can map String → String, Int → Int, blob → binary or you can use GridFS to store larger files, and so on.
Load data
The next step is to load the data into the NoSQL database. You can use any tools or custom scripts. Usually, it is done as bulk insert or batch processing operation.
Verify and test data integrity, accuracy and performance
Well, once the database is migrated, you need to run all the test cases to see if the relationships, integrity and the result of any operations or transactions are intact. Also, you should verify if you are getting the desired features that made you choose the new database.
Data modeling example
Let us get back to our blog post example. In this example, we had thought of moving the author_details data into the blog_post collection itself. As the number of authors could almost always be less than 5, it is a good idea to have this information alongside the other details about the blog post. There would be data redundancy, however a blog is usually always associated with its author, so accessing the data together could significantly improve the performance.
A blog post may have any number of comments - 1, 100 or even more. When we load the page, let’s say we want to display only the first 10 comments. If we keep all the comments in a separate collection, we will have to do a lookup operation (similar to join) which might impact the page loading time. Instead, MongoDB offers a design pattern known as partial embedding, wherein you can directly store the most recent 10 comments in the blog_post document itself. For this, you can simply add an array, let’s say recent_comments. You can manage the array in such a way that it is updated once 10 new (or you can set any number) comments are added. This way, you can just do one update per 10 new documents inserted in the blog_comments collection.
This greatly improves the readability as well as performance as all the details are available in the same place, and there is no need to do a lookup (join).
Conclusion
In this article, we have discussed what are triggers that could lead you to migrate from traditional relational to a more modern NoSQL database. We also explored the steps involved in migrating from SQL to NoSQL (MongoDB) database. We also looked at how data modeling can be done with a simple example. Note that the modeling could be different for different scenarios based on what your application read/write requirements are . For example, if your blog receives say 20 comments an hour, we would be spending a lot on the updates required to the blog_post document. In that case, referencing could prove to be a better option. Check out our post on data modeling to know the considerations you need to make before choosing your data model.