What is Data Federation?
In a real-world application, data might be stored across different systems. For example, to collect complete information about a customer, for let’s say giving them recommendations or offers, reporting or analysis, a business organization needs to get their credit card history, personal choices and preferences, job and many other public details that might be stored in different data stores. For example, all financial data could be in one database, personal and social media details on another and employment details on another.
Data federation allows you to create a (sort of) virtual database that then calls the necessary interfaces to connect with the required data stores (sources) and get the information required from each of the data sources.
The important point to note is that the user fires a single query. It is for the federated
database to understand, split the query, redirect it to the right data source, collate the responses into a single response and return a single unified view back to the client.
The data remains in its original store, and the integration happens in real-time when a query is fired.
Data in each data source can be in a different format. For example, one database could be a relational database, another a document database, or graph database and so on. The key is to build the necessary interfaces to get data.
As a result, your application does not need to consolidate data and store it in another data store, and you can retrieve the data faster, because the data need not be transformed.
Federated Query Engine
The query engine translates a single query sent by the client into multiple sub queries and sends them further to the relevant data stores. The results from the different sources are then merged by the engine and passed to the client as a single result.
Federated Queries
Federated queries are the queries that you (application/service/client) would write to get data from multiple data sources. For example, in MongoDB, you can use federated query to get data from two different MongoDB Atlas clusters, Atlas data lake, AWS S3 bucket, Azure blob storage, Atlas online archive, and HTTP and HTTPS endpoints.
For the purpose of understanding, let us consider a simple example where you have customer information, product catalog and customer orders in 3 separate datastores. Using a federated query, you can fire a single query to get information from all the three data stores, as if they were all in the same place. Below is an example of how it can be done using a simple $lookup and aggregation using MongoDB’s query language MQL. The same can be done in any relational database too using SQL.
// Federated query to get customer's order details using the customerId
const orderDetails = customerCollection.aggregate([
{
$lookup: {
from: '<order -datastore>.</order><order -database>.</order><order -collection>',
localField: '_id',
foreignField: 'customerId',
as: 'orders'
}
},
{
$unwind: '$orders'
},
{
$lookup: {
from: '<product -datastore>.</product><product -database>.</product><product -collection>',
localField: 'orders.productId',
foreignField: '_id',
as: 'products'
}
}
]).toArray();
</product></order>
In the above, we are running the federated query from the customer collection and getting information from the order and product data stores and combining them using the aggregation pipeline, using the customerId as the identifier.
MongoDB Atlas Data Federation
MongoDB Atlas offers many benefits than just getting data from different sources using federated queries. You can also control access to certain analytic nodes, by providing access only to the federated database with specific details, rather than giving access to the actual underlying database. For example, if a business analyst wants to gather data for reporting purposes, he need not require the entire data store, but just specific fields. Creating a federated database would filter out the unnecessary fields from the required ones, giving a comprehensive yet controlled view.
MongoDB Atlas data federation is easy to set up. If you already have a cluster (free) on Atlas, you can login to your Atlas account, and click on the Data Federation tab to create your first federated database instance:
You can then connect to your application or data through the drivers, MongoDB Compass, MongoDB shell or even SQL tools to analyze and visualize data.
To configure and add data sources, you can click on the Configuration button next to the Connect. You will get the add data source screen.
You can view the Visual or the JSON version of the instance from the Atlas UI. If the data is in another format, let us say a .CSV, Atlas automatically converts it into the document form when you add the source. It maps the columns of the file to the field names (keys) and each row becomes a single document.
To add sources, you can choose from the multiple options:
Every federated database instance can have virtual databases (from various data sources) and collections that are mapped to the data in your data stores. The Atlas data federation architecture mainly consists of three planes - the control plane, where the results are aggregated, the compute plane, where all the incoming requests are processed and the data plane, where the data sits.
You can learn more about the Atlas Data Federation architecture from the official documentation page.
Conclusion
As we have learnt, data federation provides several benefits, including faster retrieval of data from multiple sources, access control and a unified view of data from different data sources. Data federation, unlike data warehousing, prevents data duplication, as you can query data from its original source, rather than making a copy of it into another source, thus reducing storage costs. Data federation also enables horizontal data scaling and parallel processing of queries. MongoDB data federation further allows you to add data sources, leverage the power of aggregations, automatically copy data from and into Atlas clusters and AWS buckets and do much more through the Atlas UI, making the process faster and more efficient.