A Concise Introduction To Data Modeling
Sometime in 2020, I took a comprehensive course on Data Engineering where I got to learn extensively the art of data modeling. It made so much sense to me and changed my approach to dealing with data as a software engineer — I was also working on Machine learning projects at the time.
However, my learning was not adequately documented, further exacerbated by the loss of access to the course. This resulted in a significant loss of information, leaving me feeling disheartened. To remedy this, I had to seek alternative sources online to compile high-quality resources for reacquainting myself with data modeling. This time, with thorough documentation, I aim to share, through a series of blog posts, the valuable insights I’ve gathered. I believe these insights could be immensely beneficial whenever you find yourself immersed in data-related tasks.
In this article, introductory topics such as what data modeling is, the key concepts in data modeling, the steps in data modeling, tools in data modeling, and best practices in data, modeling.
1. What is Data Modeling?
Data modeling is a process that involves defining the structure of your data and how it relates to other data within the system. It serves as a blueprint for designing databases and helps ensure that the data collected and stored is organized, efficient, and aligns with your business requirements.
The data modeling process requires that professional data modelers work in synchrony with business stakeholders, as well as potential users of the system. This process is foundational and should not be downplayed as it greatly influences the success of the system. Therefore, by creating a well-designed data model, you set the stage for efficient database operations, scalability, and ease of maintenance.
2. Key Concepts
You, at least, understand what data modeling is at this point and want to go through the essential concepts you will come across when you get to model data.
Before we delve into that, I’d love to introduce you to the three (3) kinds of data model instances;
- Conceptual schema: A conceptual schema is a high-level description of the information system that defines the semantics of an organization and represents a series of assertions about its nature. The most common forms for conceptual schemas are Entity-Relationship diagrams and UML Class diagrams. Simply described, a conceptual schema is the first step in organizing the data requirements.
- Logical schema: Logical schemas are diagrammatic representations of the Information system used to capture entities and how they relate to one another and are usually based on the structures identified in the preceding conceptual designs. The logical schema and conceptual schema are often implemented as one and the same.
- Physical schema: Physical schemas (as the name implies) represent the physical means used to store data using a particular database management system (DBMS).
Now, we can comfortably define the key concepts:
- Entities: Entities are objects or concepts that are represented in the data model. They can be tangible entities like a person or a product, or intangible entities like an order or a reservation. Example: In a social media application, entities could include User, Post, and Comment.
- Attributes: Attributes are the properties or characteristics of entities. They describe the details of an entity. Example: For the User entity, attributes could include username, email, and date of registration.
- Relationships: Relationships define how entities are connected or associated with each other. They establish connections between tables in a database. Example: A Post entity might have a relationship with a User entity, indicating that a post is created by a specific user.
- Keys: Keys are used to uniquely identify each record in an entity. Primary keys uniquely identify a record within a table, and foreign keys establish relationships between tables. Example: A User entity might have a primary key named UserID, which uniquely identifies each user.
- Normalization: Normalization is the process of organizing data to reduce redundancy and improve data integrity. It involves breaking down large tables into smaller, related tables. Example: Instead of storing address information in both User and Order entities, you might create a separate Address entity and establish relationships.
3. Steps In Data Modeling
One of the fastest ways to learn and accomplish tasks is to have an idea of what the end-to-end process looks like, so you will be able to estimate how much effort, time, and resources you need to put in. The approach to data modeling is not any different as it has different conventions that determine which symbols are used to represent the data and each process follows a sequence of tasks to be performed in an iterative manner.
To build a data model, I follow these steps:
- Identify entities: You need to identify the main object you want your system will need to store and manage. Ask the question “What are the major features of our system?”
- Define Attributes: For each entity you identified and chose to work with, you need to define the attributes or properties that describe it. Ask the question “What data do we need to collect?”
- Establish relationships: Now, you need to determine how entities are related to each other and use relationships to connect entities based on how they interact. Ask the question “What entities are related and what defines their relationship?”
- Normalize data: Analyze your data model for redundancy and eliminate it through normalization by breaking down large tables into smaller, related tables.
- Create data: Use visual tools like Entity-Relationship Diagrams (ERDs) to represent your data model graphically. This helps you understand the structure and communicate it to stakeholders.
4. Tools for Data Modeling
When you set out to create a data model, you need to create a clear visual representation of the entities and how they are related. To do this, there are some tools available for you.
- ERD Tools: To create ERDs (Entity Relationship Diagrams) Use tools like draw.io, Lucidchart, or database-specific tools to create Entity-Relationship Diagrams.
- Database Management Systems (DBMS): You can create data models using DBMS tools, such as MySQL Workbench or Microsoft SQL Server Management Studio, as they include features for designing and visualizing data models.
- Pen and Paper: You can can use the universal pen and paper method to create visualize your data model, just like me — I find it easier to model on my board during brainstorming. However, it’s better to transfer the models to electronic tools like lucidchart and draw.io.
5. Best Practices
When modeling data, it’s important to keep in mind its connection with the broader business context . As a data modeler, whether as a software engineer or data engineer, it’s very easy to lose sight of the larger perspective, so it’s important you follow these steps
- Understand Business Requirements: You have to work closely with stakeholders to understand the business requirements that the data model must fulfill.
- Keep It Simple: Strive for simplicity, thus avoiding unnecessary complexity that can make the data model challenging to understand and maintain.
- Iterate: Data modeling is an iterative process, so you need to be prepared to revise and refine your model as the project evolves.
- Document: Don’t underestimate documentation. You need to document your data model thoroughly, including entities, attributes, relationships, and any business rules.
The most important thing when data modeling is to remember how it relates to the overall business. As a software engineer it can be easy to forget the bigger picture, so make sure you are working with business analysts and any other relevant stakeholders to make sure the data model reflects reality.