CodingBlocks

Picking the Right Database Type – Tougher than You Think

Feb 5 '24

You asked, we listened! A request from one of our Slack channels was to go over the various types of databases and why you might choose one over another. Join us in another information filled episode where Joe won’t be attending the event he’s been promoting and Allen tries to keep his voice together for the entirety of the episode, and almost succeeded.

News

Reviews

iTunes: ivan.kuchin, MikeW717
Spotify: Darren Pruitt, chutney3000

Upcoming Events

Orlando Code Camp – Conference is February 24th
https://orlandocodecamp.com

Miscellaneous

Kudos to Dell Support on their monitors
The Cat 8 journey will be beginning soon
Home offices – random desires

Database Types

Primary resource we used

https://db-engines.com/en/ranking

Some terminology we’ll be using

Schema on write – the schema for the data is determined before writing the record
Schema on read – the schema of the data is understood by the client using the data

Relational DBMS

Popular – 1. Oracle, 2. mySQL, 3. Microsoft SQL Server, 4. PostgreSQL, 8. IBM DB2, 9. Snowflake, 11. Microsoft Access
Schema on write
Primary language / form of access is SQL
Schema is defined by named tables with named columns and specific data types
Data exists as rows in the table that conform to the columns/types that are defined in the schema
Scalability – typically vertical scaling (increasing available CPU/RAM) is the preferred way
- Horizontal scaling with most RDBMS’s is generally complex and requires a lot of thought and effort
  - https://www.designgurus.io/blog/scaling-sql-databases
Can be very performant but requires knowledge on how to index and store data properly
- Even with excellent design and indexing, performance can suffer as size of data grows
Some fun Instragram posts on scaling their databases
- https://instagram-engineering.com/sharding-ids-at-instagram-1cf5a71e5a5c
- https://earthweb.com/how-many-pictures-are-on-instagram/

Key-value stores

Popular: 6. Redis, 15. Amazon Dynamo DB, 27. Azure Cosmos DB, 35. Memcached, 54. etcd
Schema on read
No real language – usually an API to put and get documents
Depending on the key value store, complex data structures may be stored and ability to query in various ways
Scalability – horizontally scalable – massively
Very performant
Many have built in extended functionality beyond looking up by a single key – for instance, Redis allows search engine type of filtering
Why’s Hadoop not on the list?
https://db-engines.com/en/blog_post/16

Document Stores

Popular: 5. MongoDB, 15. Amazon DynamoDB, 17. Databricks, 27. Azure Cosmos DB, 34. Couchbase
Schema on read
DBMS specific querying – usually offer a SQL capability but often times is not the most powerful way to query the data
Documents do not need to conform to any schema
- Multiple documents in the same collection can have completely different fields/properties, OR they have have the same properties with different data types
- Documents can contain collections in fields or even nest other documents
- Typically stores data in JSON like documents
Can be very performant but may require care to create proper indexes, manage connections, etc

Time Series DBMS

Popular: 28. InfluxDB, 50. Prometheus, 52. Kdb, 79. Graphite, 73. TimescaleDB
Schema on read
Has special features specifically tailored to time series data that isn’t quite as easy / performant in a regular RDMBS or Key/Value store
- Things like querying instants, range vectors, complex joins on ranges, etc
  - https://prometheus.io/docs/prometheus/latest/querying/basics/
- Also have built in functions specific to the needs of time series data – things like rates, deltas, histograms, quantiles, etc
  - https://prometheus.io/docs/prometheus/latest/querying/functions/
Scalability seems to vary – InfluxDB is set up for scaling via clusters with meta and data nodes, whereas Prometheus has a different federated approach
- Scaling Prometheus – https://logz.io/blog/prometheus-architecture-at-scale/
- Scaling InfluxDB – https://www.influxdata.com/blog/influxdb-clustering/
Very performant for querying time series related data
- Obviously there’s always things to consider – such as histograms vs quantiles in Prometheus – client vs server side
  - https://prometheus.io/docs/practices/histograms/

Graph DBMS

Popular: 22. Neo4j, 27. Azure Comsos DB, 59. Aerospike, 75. Virtuoso, 85. ArangoDB
Schema on write (mostly) – not sure if all graph databases force labels and attributes to be consistent
- https://neo4j.com/docs/getting-started/data-modeling/guide-data-modeling/
Different in terms of functionality than other databases – graph databases store data in terms of nodes and edges
- Edges are the relationships between the nodes
Great explanation on the Neo4j website – https://neo4j.com/docs/getting-started/data-modeling/guide-data-modeling/
Use cases – https://neo4j.com/use-cases/
- Fraud and detection analysis
  - Financial Fraud Detection with Graph Data Science
  - Money Laundering Prevention with Neo4j
  - Why Intelligent Applications Need a Graph Database with Granular Security
  - Fraud Detection with Neo4j
- Identity and access management
- Network and IT operations
- Real time recommendations
So why a graph database? Can’t you do this with an RDBMS and joins?
- The friend of a friend scenario – a graph database can easily and performantly return relationships with 20 degrees of separation or more – try that in a SQL query and watch your mind and database engine melt
  - https://neo4j.com/videos/why-neo4j-3/
Neo4j has built in scalability via sharding – https://neo4j.com/product/neo4j-graph-database/scalability/

Search engine

Popular: 7. Elasticsearch, 14. Splunk, 24. Solr, 40. OpenSearch, 58. MarkLogic
Extensions of NoSQL databases
Schema on read
Complex search expressions
Full text search
Stemming – reducing words to their root forms so that searches can be more accurate with similar word searches
Ranking and grouping of search results
Built for scalability
Incredibly performant for the use case
Not great with relationship data
Why choose over something like a relational or document database?

Resources

https://db-engines.com/en/ranking
https://db-engines.com/en/articles
All the DB vendor websites – so much good information
Designing Data Intensive Applications

Tips of the Episode

Hot tip for a multi-user document oriented distributed database that’s free, open source and you probably know how to use it already …
- Bonus points for supporting history
- The downsides…
  - It’s slow at writing, and reading, and querying, and the syntax isn’t easy to learn…but other than that it’s great!
    https://gitrows.com/ https://github.com/DavidBruant/github-as-a-database
kubectl cp
https://kubernetes.io/docs/reference/kubectl/generated/kubectl_cp/
Hardware – Navepoint Rack chassis hinge
https://navepoint.com/cabinet-accessories/wall-mount-hinge-adapter/
Bonus: ksync – a kubernetes tool for syncing files across clusters or local environments but it does require setting up an agent in the cluster
https://github.com/ksync/ksync
14u DIY Desk
https://www.reddit.com/r/homelab/comments/rouh7m/my_14u_diy_desk_integrated_server_rack_is_finally/