Table of Contents
Introduction
SELECT hello
FROM me
WHERE you're here
BECAUSE enjoy
Before the creation of the database, data organization took the form of a list in simple text files. Subsequently, databases were created and models replaced their predecessors with updates and solutions, as is the case in many corners of the tech industry. As the models progress, you'll notice how they utilize methods of file referencing in different ways. Including how the Relational Model does it best with a method called indexing.
See how the Relational Database Model changed the way data is structured and how data scientists and programmers interact with that data.
Database Theory
A database is an organized collection of structured information. They host large amounts of data on servers dedicated to the task. However, before databases came into use, data stores were unrelated. In order to relate one data file to another, programmers would have to work hard to extract data by complex parsing. There were languages that made the process a bit more efficient, but there were still issues such as duplicates in the data and conflicting versions leading to inaccurate data being supplied. To resolve this problem and create a standard way to interact with data in a database, Database Management Systems were created or DBMS. This upgrade allowed programmers to focus on interacting with the data and not data access issues or whether the data was inaccurate.
Indexing
File referencing has been made easier with the implementation of indexing in databases. It is when there is a common id or key that allows files, tables in databases and columns in a data frame to relate to one another.
Database Models
The Hierarchical Database Model
One of the early database models is known as the hierarchical model. The files were related in a tree-like way, similar to a family tree. The origin file is known as the parent and the lower level file is known as the child. The relations started at the top, A1, and extended outwards to level B. Here, it split in two, B1 and B2, which branched out to C1, C2, C3 and C4, C5, C6 respectively. This trend continues throughout the model.
The main issues concerning this model were that too many files are related to each other and there is little flexibility with adding new files. There needs to be knowledge of the entire database structure in order to interact with it efficiently, which can be hard to achieve when you're dealing with such large databases.
The Network Database Model
The network model came next, with a goal to resolve its predecessor's problem of low flexibility. They did this by allowing a member (child) to have more than one owner (parent). However, while creating the ability for more complex relationships, a programmer still needed to know the structure of the database, which is one of the main problems that still needed fixing after the creation of this model.
The Relational Database Model
The most currently common model is called the Relational Database Model. It drew the attention away from direct relationships between certain files and pointed towards relationships based on common values in fields, known as common keys. The Relational Database Model has a downside that it’s difficult to connect the information since they’re all in different tables and that’s where indexes come in. The concept of having an identifier in each table in order to link the tables is ingenious and has so many clear advantages. The complexity of the structure of the database decreased and the flexibility of creating new relationships between tables increased.
It was developed in 1970, by E.F Codd and it was first thought to be impractical since there wasn’t sufficient hardware at the time to sustain the model. As time passed, hardware upgraded immensely, to the point where now even any laptop or cellphone can run relational database management systems.
This database model goes hand in hand with SQL, a big reason for the popularity of the model, which will be discussed below.
Tools
With every database there are tools that are needed to interact with it. Below are two examples of popular choices.
Pandas
Pandas is a python library that has built in methods to speed up many processes. It has the method pd.read_sql( )
to load in data from databases and translate it into dataframes that can be manipulated with pandas.
SQL
Structured Query Language or SQL is the most popular programming language for interacting with relational databases and their DBMS. Data scientists use it to access and manipulate data in the database. It is very generous with its syntax, some say to a fault, because it can lead to inconsistencies. However, it is still very popular and has many different dialects such as PostgreSQL, MySQL, and Sqlite.
SQL Syntax
To retrieve an entire database:
SELECT *
FROM table
This will output all fields (columns) and all records (rows). But be careful this may take a while to load for a large database.
Here's the general syntax for querying:
SELECT columns
FROM table
WHERE condition
ORDER BY columns
GROUP BY column
LIMIT number
I'll walk you through it.
Query Breakdown
SELECT
- Choose the fields (columns) you'd like to consider in your result set.
* *
selects all fields
FROM
- enter the table your retrieving the data from
WHERE
- condition selection is based on
ORDER BY
- which numeric field you’d like to order the result set by
* ASC
ascending is the default order
* DESC
to sort in descending order
GROUP BY
field to group the resulting values by
LIMIT
- number of entries you’d like to limit the result set to
Conclusion
SELECT thanks
FROM conclusion
WHERE you’re here
BECAUSE i appreciate it
For more information on Database Theory, click here, this article has a lot of interesting and clear insights on the topic.
Top comments (0)