Keeping SQL Queries Fast for Large Tables
I don't have expertise in this topic, only a bit of experience with a problem concerning a particularly large table. Basically, SQL databases can handle lots of data pretty well. There are better databases for handling hundreds of millions and billions of rows of data, but SQL databases can usually handle that stuff fairly well, just not very quickly. Here are some things to consider to keep SQL queries manageable for very large (over a million rows) tables. The following pointers are taken from a use case where I sped up an application by optimizing the database.
The Use Case
To keep things ambiguous, I'll simply say that this particular table was being used to store user transactions on a wide variety of data types. Queries to this table were slowing down asynchronous data fetches from the frontend to the point where the network connection was timing out. Even if the page were to load, the query for a particular type of data happened so slowly that a user would have to wait 30 seconds for an edit page to load and display data.
Index Properly
The most important thing to consider when anticipating a SQL database to accumulate millions of rows of data is the indexing. Without proper indexing, querying that amount of data will start to take a long time. In my use case, the database didn't have an index for a column that was important to a particular query. Adding an index would have sped things up considerably, but there was another problem to solve before I could do that.
Data Types
The column that needed the index was a TEXT data type, even though there were only 2 or 3 different possible values for that column. This led to the realization that almost all of the columns were of the TEXT data type. TEXT data types in a SQL database are not stored as efficiently as other types such as VARCHAR; the database essentially maintains a pointer to a file on the machine, rather than storing the value in a more direct manner. In addition, indexing a TEXT data type in a SQL database isn't as trivial as indexing other data types.
After Changing Data Types and Adding Indexes
After changing data types to more efficient, compatible data types and adding the necessary indexes, queries that were timing out after several minutes were now only taking a few seconds. I don't have the exact metrics, but the change was as drastic as I'm making it out to be.
Further Optimizations
Even after everything, the query was still taking a few seconds. Given that this was happening on a frontend application in a very common screen, it's still worth considering making faster. Solving the problem further requires taking a step back and looking at the problem itself. In this case, the data being pulled in was the most recent data. But the table was storing all data ever. The solution is to keep only the most relevant data on a table that will be accessed by the frontend. Segregate the archived data into a separate table, and keep it away from the initial query. If a user wants to see older data, the UI will be responsible for keeping it behind an additional action by the user; the resulting delay in retrieving that data will be understandable. This happens on banking websites when you want to see your "older" transactions. Only the first week or month of transactions is displayed, and after that, you have to enter the "history mode" to see all past activity.
In a database design, there would be two tables. One for storing the most recent data, and another for archiving all data. How the data gets migrated from one to the other is another question. Maybe the data can be written twice, once to each table. Or maybe there will be an automated script or archiving process on the database itself.
Comments:
Leave a Comment
Submit