Especially for non-equality comparisons; and sorting.
If it is equality; maybe it is like a sorted hashtable? Not that I know much about how hashtable work...
But what about more than
, less than
, string startsWith
, string endsWith
, string contains
, string case-insensitive operations
? What are possible ways to optimize this (probably on write)?
Top comments (6)
In non-technical terms, database indices are structured more or less like associative arrays.
Assuming you have a table called
users
, where each user is a row. That table would be, by default, saved in a way like this:[ID] => row
Now let's assume you create an index /we'll ignore index types for now) on the "email" column of your row because you do a lot of searching for users by email. Your database engine will now ALSO store this table like this:
[ID] => row,
[email] => row
When you run your query, the engine will identify possible indexes to match based on its content, and then use that version of the saved table to find results.
If your query has WHERE email="brandinchiu@gmail.com", then it will know to use the email index you created beforehand.
Now this is reasonable for simple indices, but gets more complicated when you start looking at compound keys, and when you start having indices conflict with each other.
This is why properly identifying which indices matter is so important. Every time you create an index, your database gets significantly bigger, relative to the size of the table where the index is. Optimally, you want your database to run in memory, so having ten different indices on a table with a hundred million rows is going to a headache.
Different index types help you tell the engine how to optimize the queries that use them. Integer-based indexes will always be faster, and performing arithmetic against them is going to be relatively trivial.
String indices are possible, but should be limited in scope whenever possible. Indexing something like a "description" column for purposes of searching is not something you should be doing. Instead, you would create a separate table of individual "keywords" and use the foreign key to search, not the actual paragraph text itself.
What about pre-segmentation of strings for
string contains
operators, probably just like how full-text-search in SQLite Virtual Table works?Currently not even understanding why hashtable is near O(1), but I believe
string startsWith
operator is just akin tointeger greater than
?Now that you mentioned it, how do they work with best performances in relative large datasets?
Also, about
String indices
, if I don't plan to make unique, it might be possible to optimize, by limiting index size, and looking up multiple times?A lot of this is going to be engine-dependant, but in almost all cases I can think of, string indices should be avoided for the purpose of searching and/or sorting.
In MYSQL, string operations are just integer operations, but there's a step of converting the strings to integers which degrades performance at scale.
MYSQL supports the FULLTEXT index type which indexes the entire string column. However, the larger the string is, the less performant the search will be unless you're searching for the exact, entire string (which rarely happens and defeats the purpose of a string index in the first place)
To my knowledge mainstream relational databases (MySQL, PostgreSQL, SQLite) all use a form of B-tree, because hashtable is not good for comparison. Abstractly you can view B-tree as a sorted set while a hashtable an unordered set. For string operations I suspect they are just implemented as regular expressions, which if you limit them to no look back can be implemented efficiently as finite state machines.
Same as Table of Index of a book, You look up chapter name and go straight to page where that chapter starts, you don't roll page one by one till the chapter starts.
Why is it O(1), yet comparable? How do I implement it in JavaScript / Python (i.e. not using existing libraries)?
The point is, how can I have faster substring indices? How can I have prebuilt indices?