Querying and Ranking
Now that I have the database structure of the stock recommendation system set up I will start with the first step of the backend. As mentioned earlier I’m using FastApi for the backend. I’m using a modular approach in which I define the individual routes in a routers
folder and then include those inside the main.py
file in the root directory. In the stock_query.py
route the app will accept a standard query q
and set a limit of 10 results to be returned. It’s going to use ILIKE
to search for entries which include the query. This operation is made significantly faster by the Trigram Index. It then returns the results as a list of Objects.
Now that the app is successfully querying the database for tickers/titles and returning them to the user I can focus on optimizing the order of the results. Right now the results are returned in the order in which they are found. This isn’t a terrible approach and often works relatively well since the SEC file I got the data from was sorted by market cap in descending size. This leads to the largest companies being shown first and therefore the likelihood of the person looking for that stock is relatively high. However I want better and more sophisticated sorting, since I want to be able to show relevant results even for smaller stocks.
One way to do this is to check the similarity between the query and the ticker/title. This can be done by using PostgreSQL’s similarity
function to compute a similarity score. The results are then ordered by the “relevance score” which I assigned in descending order.
So I put this approach to the test and compared it to the default approach. For that I used the stock ticker “NU” with the stock title “Nu Holdings Ltd.” and input “nu” as a query. Using the default approach the first hit was “TSM” or “Taiwan Semiconductor Manufacturing Co Ltd”, this was because of the “nu” in “manufacturing” and because it had a larger market cap than NU, perfectly demonstrating my point of why I need to optimize. However NU was the second hit in the list. With the relevance score in place NU moved down to the third place, however this time the two results that were higher up were more relevant: “NRDE” (NU RIDE INC.) and “NUMD” (Nu-Med Plus, Inc.). The problem here was obvious, the algorithm clearly succeeded at sorting by relevance, only it focused on the title when the ticker is really (in most cases) more important.
Therefore for the next approach I’d have to prioritize the tickers. For that I can use the CASE
statement to assign a higher priority to matches in the ticker field. I assign a full ticker match the highest priority of 1, a partial ticker match the priority of 2 and the rest (matches in title) a priority of 3. Now NU finally moved up to the top spot. However the results below were quite disappointing this time:
After not being able to find any more priority 1 or 2 results the algorithm returned only priority 3 results in alphabetical order. This means it just returned any stock that had “nu” in it’s title in alphabetical order leading to the next result being “PPLT (abrdn Platinum ETF Trust).
After the above testing it becomes clear that the relevance and the case approach both had it’s strengths and weaknesses so I’m going to try and combine them. I want to keep giving ticker matches higher priority over title matches however I also want title matches to be sorted by relevance. Therefore a good approach seems to be to use the relevance/similarity score system but add additional weighting to ticker matches. I’ll combine PostgreSQL’s CASE
and SIMILARITY
functions to
- Assign a higher weight for exact matches.
- Assign lower weights for partial matches and title matches.
I’ll do this using the following approach:
- Exact Matches
- Assign a fixed high score (e.g.,
1.0
) to exact matches on theticker
.
- Assign a fixed high score (e.g.,
- Partial Matches
- Use
SIMILARITY()
to calculate how close the query is to theticker
andtitle
. - Weight the
ticker
similarity higher (e.g.,0.8
) than thetitle
similarity (e.g.,0.2
).
- Use
- Relevance Score:
- Combine these weighted scores into a single relevance value, sorted in descending order.
And indeed I now get “NU” as the top result followed by “NUE” (NUCOR CORP) and “NUS” (NU SKIN ENTERPRISES, INC.). After a bit of trial and error I found 0.7 weight for the ticker similarity and 0.3 for the title similarity to strike the best results. I could of course do this analysis in a much more sophisticated way by manually creating a dataset with ideal matches for certain queries and then automating evaluation by testing a range of weight combinations based on factors like Precision or Recall (How many of the ideal results appear in the top N results?) and then plotting the performance metrics to facilitate a data driven visualization of the results. However this seems a bit overkill for now so my personal approximation will have to do (Besides this begs the question of how much difference the user is actually even going to notice).