feedaro – Intelligent Newsreader
Posted on 2012/10/06
Another project from my master’s program at my university is feedaro. This project came up at the lecture “Programming Intelligent Applications” taught by Prof. Dr.-Ing. Johannes Maucher. The aim of the course was to develop an application using techniques from artificial intelligence (AI) and/or machine learning (ML). After attending lectures about the theory of AI and ML this course was about practice. Dominik Gätjens and I chose to build an intelligent newsreader.The aim behind feedaro was to create a personal news stream. The website will only show you the news you are actually interested in, no more and no less. That’s the part where the intelligence comes in: feedaro will learn your interests. It is not simply done by only tracking the users behavior and record which news were read. The difficult part is to analyze the news and figure out what they are about.
So how do you spot the topic from a given text document? While there are quite a few algorithms to choose from we decided to use LDA (Latent Dirichlet Allocation), as it seemed to be best suitable for our use case. LDA is a clustering algorithm (Edwin Chen wrote a good introduction to LDA). After the LDA model has learned each document (news) can be represented as a topic distribution and each topic can be represented as a word distribution. These generated topics describe the similarities between the documents.
The biggest part of the project was to get proper data. Otherwise the results could not be used. As a first step, the descriptions send by RSS-Feeds were used but the results emphasized soon enough that these informations were insufficiant. To improve the result we now follow the provided link and crawl for the whole article text. Next, the documents need to get sanitized. Every text must be preprocessed to remove unneccesary words that only bear little information about the content of the article. The following quote from Albert Einstein for example:
Imagination is more important than knowledge.
would be reduced to:
imagination, important, knowledge
In the first attempt we simply removed words from a stoplist. This was developed further by stemming the words and only using nouns detected by part-of-speech tagger from the Natural Language Toolkit (NLTK).
Next thing to do is to learn the interest of the users. We decided to get the information rather implicit by tracking the links the user clicked to read the whole article. After sufficant information about the users interest are gathered only the appropriate news will be displayed. The more the service is used by the user the better the results will be.
Besides the intelligent news stream two more features were added. The first one is a collaborative recommender systems. The news that were read by the most similar users (the users who mostly read the same news) will be recommanded. The second one is the calculation of the most importand news for each week that can be displayed by feedaro.
If you like the idea of feedaro and wonder where you can test it I have to disappoint you: You can’t. Although we had in mind to release the project at the beginning of it we realized quickly that there is a huge difference between a prototype and a version ready for production. The same applies to the theory of artificial intelligence and the practical usage. Working with real world data is hard. After some adjustments we often just couldn’t decide if the algorithm worked better or worse than before.
The whole project was about to get practice and experience at programming intelligent applications and that we got.
The following tools were used:
- MongoDB, Redis, SQLite
- Git (Bitbucket)