CodingBlocks
Picking the Right Database Type – Tougher than You Think
You asked, we listened! A request from one of our Slack channels was to go over the various types of databases and why you might choose one over another. Join us in another information filled episode where Joe won’t be attending the event he’s been promoting and Allen tries to keep his voice together for the entirety of the episode, and almost succeeded.
News
Reviews
- iTunes: ivan.kuchin, MikeW717
- Spotify: Darren Pruitt, chutney3000
Upcoming Events
- Orlando Code Camp – Conference is February 24th
https://orlandocodecamp.com
Miscellaneous
- Kudos to Dell Support on their monitors
- The Cat 8 journey will be beginning soon
- Home offices – random desires
Database Types
Primary resource we used
Some terminology we’ll be using
- Schema on write – the schema for the data is determined before writing the record
- Schema on read – the schema of the data is understood by the client using the data
Relational DBMS
- Popular – 1. Oracle, 2. mySQL, 3. Microsoft SQL Server, 4. PostgreSQL, 8. IBM DB2, 9. Snowflake, 11. Microsoft Access
- Schema on write
- Primary language / form of access is SQL
- Schema is defined by named tables with named columns and specific data types
- Data exists as rows in the table that conform to the columns/types that are defined in the schema
- Scalability – typically vertical scaling (increasing available CPU/RAM) is the preferred way
- Horizontal scaling with most RDBMS’s is generally complex and requires a lot of thought and effort
- Can be very performant but requires knowledge on how to index and store data properly
- Even with excellent design and indexing, performance can suffer as size of data grows
- Some fun Instragram posts on scaling their databases
Key-value stores
- Popular: 6. Redis, 15. Amazon Dynamo DB, 27. Azure Cosmos DB, 35. Memcached, 54. etcd
- Schema on read
- No real language – usually an API to put and get documents
- Depending on the key value store, complex data structures may be stored and ability to query in various ways
- Scalability – horizontally scalable – massively
- Very performant
- Many have built in extended functionality beyond looking up by a single key – for instance, Redis allows search engine type of filtering
- Why’s Hadoop not on the list?
https://db-engines.com/en/blog_post/16
Document Stores
- Popular: 5. MongoDB, 15. Amazon DynamoDB, 17. Databricks, 27. Azure Cosmos DB, 34. Couchbase
- Schema on read
- DBMS specific querying – usually offer a SQL capability but often times is not the most powerful way to query the data
- Documents do not need to conform to any schema
- Multiple documents in the same collection can have completely different fields/properties, OR they have have the same properties with different data types
- Documents can contain collections in fields or even nest other documents
- Typically stores data in JSON like documents
- Can be very performant but may require care to create proper indexes, manage connections, etc
Time Series DBMS
- Popular: 28. InfluxDB, 50. Prometheus, 52. Kdb, 79. Graphite, 73. TimescaleDB
- Schema on read
- Has special features specifically tailored to time series data that isn’t quite as easy / performant in a regular RDMBS or Key/Value store
- Things like querying instants, range vectors, complex joins on ranges, etc
- Also have built in functions specific to the needs of time series data – things like rates, deltas, histograms, quantiles, etc
- Scalability seems to vary – InfluxDB is set up for scaling via clusters with meta and data nodes, whereas Prometheus has a different federated approach
- Scaling Prometheus – https://logz.io/blog/prometheus-architecture-at-scale/
- Scaling InfluxDB – https://www.influxdata.com/blog/influxdb-clustering/
- Very performant for querying time series related data
- Obviously there’s always things to consider – such as histograms vs quantiles in Prometheus – client vs server side
Graph DBMS
- Popular: 22. Neo4j, 27. Azure Comsos DB, 59. Aerospike, 75. Virtuoso, 85. ArangoDB
- Schema on write (mostly) – not sure if all graph databases force labels and attributes to be consistent
- Different in terms of functionality than other databases – graph databases store data in terms of nodes and edges
- Edges are the relationships between the nodes
- Great explanation on the Neo4j website – https://neo4j.com/docs/getting-started/data-modeling/guide-data-modeling/
- Use cases – https://neo4j.com/use-cases/
- Fraud and detection analysis
- Financial Fraud Detection with Graph Data Science
- Money Laundering Prevention with Neo4j
- Why Intelligent Applications Need a Graph Database with Granular Security
- Fraud Detection with Neo4j
- Identity and access management
- Network and IT operations
- Real time recommendations
- Fraud and detection analysis
- So why a graph database? Can’t you do this with an RDBMS and joins?
- The friend of a friend scenario – a graph database can easily and performantly return relationships with 20 degrees of separation or more – try that in a SQL query and watch your mind and database engine melt
- Neo4j has built in scalability via sharding – https://neo4j.com/product/neo4j-graph-database/scalability/
Search engine
- Popular: 7. Elasticsearch, 14. Splunk, 24. Solr, 40. OpenSearch, 58. MarkLogic
- Extensions of NoSQL databases
- Schema on read
- Complex search expressions
- Full text search
- Stemming – reducing words to their root forms so that searches can be more accurate with similar word searches
- Ranking and grouping of search results
- Built for scalability
- Incredibly performant for the use case
- Not great with relationship data
- Why choose over something like a relational or document database?
Resources
- https://db-engines.com/en/ranking
- https://db-engines.com/en/articles
- All the DB vendor websites – so much good information
-
Designing Data Intensive Applications
Tips of the Episode
- Hot tip for a multi-user document oriented distributed database that’s free, open source and you probably know how to use it already …
- Bonus points for supporting history
- The downsides…
- It’s slow at writing, and reading, and querying, and the syntax isn’t easy to learn…but other than that it’s great!
https://gitrows.com/ https://github.com/DavidBruant/github-as-a-database
- It’s slow at writing, and reading, and querying, and the syntax isn’t easy to learn…but other than that it’s great!
- kubectl cp
https://kubernetes.io/docs/reference/kubectl/generated/kubectl_cp/ - Hardware – Navepoint Rack chassis hinge
https://navepoint.com/cabinet-accessories/wall-mount-hinge-adapter/ - Bonus: ksync – a kubernetes tool for syncing files across clusters or local environments but it does require setting up an agent in the cluster
https://github.com/ksync/ksync - 14u DIY Desk
https://www.reddit.com/r/homelab/comments/rouh7m/my_14u_diy_desk_integrated_server_rack_is_finally/