My website provides a large amount of tagged and categorized content, and I’m attempting to create a strong algorithm to line the content up with user needs. Users are able to select the tags that they are most interested in, and content that matches up with those tags are given more priority. Furthermore, the content can be “liked” (pushes it up in priority) and time decay pushes it down in priority. So, ultimately, the user should see relatively new content that is in line with what they’re interested in, and is also popular with other users.
The current algorithm works like this:
- Pull all items from
contenttable. Assign each item a score of 1.
user_tagsto see if there are matches in the array pulled from step 1. If so, apply a multiplier to that item’s score.
content_likesto see how many likes each item has. Apply another multiplier, based on this amount.
- Apply a third factor based on the time decay of the item. Obviously, older items receive a bigger penalty than newer items.
- Sort by total score. The resulting array should have the most relevant items first. Then, I can simply trim this array down to 20 or so items and display them on the page.
As you can probably tell, this is a sluggish algorithm and, not only does it have to run a query to pull every single piece of content, but it then has to run separate queries to check
content_likes. Too many queries!
I suppose my first question is: am I doing this all wrong? Beyond that, can you think of any ways to optimize everything I’ve summarized above? The algorithm itself works quite well, assuming items and users have relevant tags. But I’m afraid that when my
content table grows to tens of thousands of items, I’ll be in a real mess.
Thanks for your help!