SEDE query to count questions, views and unanswered for a set of tags

Question

This query selects the number of SO questions, the number of views and the number of unanswered questions for each tag (the list of the tags is the user input). It works fine when it works, but it times out with the error message Line 0: Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding. if I add gnuplot to the list.

Is there a way to optimise it to avoid getting an error?

The line AND Posts.Tags LIKE '%<%' should not make any difference, but it seems to increase the chances that the query returns the result (maybe it's just an accident?)

Adding the execution plan screenshots below.

Bonus question: is there a better way to use STRING_SPLIT to create the auxiliary table?

-- INPUT EXAMPLE: pandas-datareader,google-finance-api,yahoo-finance,alpha-vantage,ta-lib,yfinance,google-finance
-- If the query times out, run it in batches, two or three tags at time

CREATE TABLE #KeyTags (
    key_word VARCHAR(100) COLLATE SQL_Latin1_General_CP1_CS_AS);
GO 

INSERT INTO #KeyTags (key_word) 
SELECT * FROM STRING_SPLIT(##CommaSeparatedOptions:string##, ',');
GO

SELECT #KeyTags.key_word AS key_word, 
SUM(CAST(ViewCount AS BIGINT))AS viewed, 
COUNT(Posts.ViewCount) AS question,
SUM(CASE WHEN ((Posts.AnswerCount < 1) AND (Posts.ClosedDate IS NULL)) THEN 1 ELSE 0 END) 
                               AS unanswered_question                             
FROM Posts JOIN #KeyTags ON Posts.Tags LIKE CONCAT('%<',#KeyTags.key_word,'>%')
WHERE Posts.PostTypeId = 1 
AND Posts.Tags LIKE '%<%' 
GROUP BY key_word
ORDER BY viewed DESC;

\$\begingroup\$ the bonus question should be asked separately \$\endgroup\$
– jsotola
Commented Dec 2, 2023 at 1:17 — jsotola, Commented Dec 2, 2023 at 1:17

Aaron Bertrand · Accepted Answer · 2024-05-01 18:52:11Z

Scanning Posts is going to be an expensive operation no matter how many or which tags you include, complicated by the fact that every post can have as many as 5 tags. In the existing version, here is the interesting part of the execution plan:

This took 75 seconds for me, and performed 89 million page reads. There are multiple problematic operators I've highlighted:

the highlighted sort didn't have enough memory and had to spill to tempdb (which is never fast)
the compute scalar (CASE expression) and clustered index scan (LIKE) actually had to read all 72 million rows, and compare, one-by-one
since your #temp table isn't indexed, it actually ended up having to be scanned (albeit a very cheap scan) 72 million times and the compute scalar executed every time too...
```
  Posts.Tags LIKE CONCAT('%<',#KeyTags.key_word,'>%')
```
...since SQL Server has to manually check this match for every single row coming out of the scan, before any filtering.

It is much better to join / EXISTS against PostTags (by pre-calculating the TagId values instead of matching them using LIKE). You are currently storing the tag names in a #temp table; if you're going to do that, you should at least match the type exactly (nvarchar(35)). But I think it's much better to store both the name and the Ids, e.g.:

CREATE TABLE #Tags
(
  TagId   int PRIMARY KEY,
  TagName nvarchar(35)
);

  INSERT #Tags(TagId, TagName)
  SELECT t.Id, t.TagName
    FROM Tags AS t
   WHERE EXISTS
   (
     SELECT 1 
       FROM STRING_SPLIT(##CommaSeparatedTags:string##, N',') AS f
      WHERE TRIM(LOWER(f.value)) = t.TagName
   );

Now I can just join Posts via PostTags by Ids, which it can do with (lots of) seeks all around.

SELECT Tag        = t.TagName,
       Viewed     = SUM(CONVERT(bigint, p.ViewCount)), 
       Questions  = COUNT(p.Id),
       Unanswered = SUM
                    (
                      CASE WHEN p.AnswerCount < 1
                      AND p.ClosedDate IS NULL
                      THEN 1 ELSE 0 END
                    )
  FROM Posts AS p
 CROSS JOIN #Tags AS t
 WHERE p.PostTypeId = 1 
   AND EXISTS 
       (
         SELECT 1 
           FROM PostTags AS pt
          WHERE pt.PostId = p.Id
            AND pt.TagId  = t.TagId
       )
 GROUP BY t.TagName
 ORDER BY Viewed DESC;

Here is the meat of the new execution plan:

The plan now doesn't have any scans, doesn't have any operator looking at more than 4 million rows, and doesn't have to sort 4 million rows, either (my sort occurs after the aggregation, so it only sorts 3 rows and wasn't even worth capturing in the screenshot). The main issues now are:

a key lookup, which is just me dealing with the indexes available
an index seek with a warning about a missing statistic for the Id column

This ran for me in 19 seconds and performed a total of 32 million reads (just over a third of the original).

SEDE query link

As for the bonus question, no, I can't think of anything that's wrong with how you're using STRING_SPLIT, just that you have to account for case sensitivity and leading spaces (which is why I applied TRIM(LOWER()) to the output). I also filter out any garbage thrown in there that isn't a tag (but not garbage specifically, because that actually is an active tag on Stack Overflow).

Stack Exchange Network

SEDE query to count questions, views and unanswered for a set of tags

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
performance
time-limit-exceeded
t-sql
stackexchange
or ask your own question.

Hot Network Questions

SEDE query to count questions, views and unanswered for a set of tags

1 Answer 1

Not the answer you're looking for? Browse other questions tagged performancetime-limit-exceededt-sqlstackexchange or ask your own question.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
performance
time-limit-exceeded
t-sql
stackexchange
or ask your own question.