I have a query that selects from only one table and with one
WHERE filter. However it takes a very long time to execute and even times out occasionally. This is likely because it is filtering about 4 million rows out from a table of 13 million rows (the other 9 million records are older than 2019), and it is returning all of the columns, of which there are 101 (a mix of
int columns). It has two indexes, a clustered one on its primary key
interaction_id, and an unclustered index on
interaction_date which is a datetime column that is the main filter. This is the query:
SELECT * FROM [Sales].[dbo].[Interaction] WHERE year(Interaction_date) >= 2019
Is there anything obvious I can do to improve this query’s performance by adding/tweaking indexes or tweaking the query itself? Before I go into an ETL processes or fight back on the group that needs this query (they are a Hadoop sqooping team who insist they need to sqoop all of these records all the time with all of the columns), I want to see if I can make it easier on people by doing something on my end as the DBA.
The query plan by default ignores my non-clustered index on the
interaction_date column and still does a full clustered index scan. So I then tried forcing it to use it by including
WITH (INDEX(IX_Interaction_Interaction_Date)) in the select.
This forces it into the query plan startign with an index scan of the non-clustered index, with estimated rows 4 million but estimated rows to be read as all 13 million. Then after a short time it spends the rest of the execution on the key lookup of the primary clustered index.
But ultimately, it doesn’t seem to speed up the query at all.
Is there anything obvious I can do to improve this query’s performance by adding/tweaking indexes or tweaking the query itself?
Yes. First make the predicate sargable.
SELECT * FROM [Sales].[dbo].[Interaction] WHERE Interaction_date >= '20190101'
And then consider partitioning, or a filtered index with included columns. But even if you have an index that can support this query as a simple seek+scan, sending all the columns to the client takes time.
Note: I’m not going to comment on the query itself, but I’m going to instead talk about something which also impacts this, considering the size of this query.
Some basic math, assuming the average size of data inside varchar columns is 30 characters and columns are roughly split between the 3 types:
101 columns * ((30+8+4)/3) = 14 bytes per column * 4,000,000 rows= roughly 5.6 billion bytes
A little over 5 and a half GB, just for the data (it would take about 86,500 UDP packets to transport at full size). And this is probably a conservative estimate considering the average bytes per column probably is significantly higher than 14 bytes. Every additional average byte per column adds another 400 MB of data transfer needed.
Depending on your connection and assuming direct connection with no other traffic, this would take anywhere from 40 seconds (assuming Gigabit Ethernet) to 400 seconds, or almost 7 minutes, using 100Mbit ethernet purely for this data. Any additional latency, data traffic on the line or other interference will also impact this time needed.
I know your client is using Hadoop and Sqoop, but I’m talking about the physical limits of the connection, and there is not that much that software can do about that. It is entirely possible that most of the time the query takes is just getting it from your SQL server to your Hadoop system. You can check how much the impact is of this by trying to manually copy over a 5.6 GB file from the SQL server to whatever it is Hadoop is running on. That would give a good estimate of the time your query will take to be moved from your SQL machine to your Hadoop machine.
The reason why index is not used is that the selectivity is too small – at the estimated 4M rows out of 13M rows in the table it means that 30% of all records are estimated to be read. Instead of looking up 30% of all the data in random access (through key lookup), it’s faster to read everything and filter it in DB engine.
There are basically three options to speed up this query:
- Columnstore index (clustered or nonclustered with all columns included). Column Interaction_date is likely filled sequentially, so years >=2019 will be filled only in newest row groups. Ideal if your table is mostly read, because writing to a columnstore index is definitely slower compared with rowstores.
- Clustered index – you already write that your table has one, so it’s probably a no-go. But if you have just one date column in your table, it’s usually best candidate for clustered index. If you e.g. have clustered index on identity column, change it to non-clustered (the performance impact is usually minimal, although test first…) and change the index on date column to clustered. This will get you the best performance by far, as long as you can do it.
- Partition. Generally a bad answer unless you know what you’re doing, and definitely not recommended if you have just 13M rows. But it can speed up this query.