Crowd- and nichesourcing for film and media scholars

[This post describes Aschwin Stacia‘s MSc. project and is based on his thesis]

There are many online and private film collections that lack structured annotations to facilitate retrieval. In his Master project work, Aschwin Stacia explored the effectiveness of a crowd-and nichesourced film tagging platform,  around a subset of the Eye Open Beelden film collection.

Specifically, the project aimed at soliciting annotations appropriate for various types of media scholars who each have their own information needs. Based on previous research and interviews, a framework categorizing these needs was developed. Based on this framework a data model was developed that matches the needs for provenance and trust of user-provided metadata.

Fimtagging screenshot
Screenshot of the FilmTagging tool, showing how users can annotate a video

A crowdsourcing and retrieval platform (FilmTagging) was developed based on this framework and data model. The frontend of the platform allows users to self-declare knowledge levels in different aspects of film and also annotate (describe) films. They can also use the provided tags and provenance information for retrieval and extract this data from the platform.

To test the effectiveness of platform Aschwin conducted an experiment in which 37 participants used the platform to make annotations (in total, 319 such annotations were made). The figure below shows the average self-reported knowledge levels.

Average self-reported knowledge levels on a 5-point scale. The topics are defined by the framework, based on previous research and interviews.
Average self-reported knowledge levels on a 5-point scale. The topics are defined by the framework, based on previous research and interviews.

The annotations and the platform were then positively evaluated by media scholars as it could provide them with annotations that directly lead to film fragments that are useful for their research activities.

Nevertheless, capturing every scholar’s specific information needs is hard since the needs vary heavily depending on the research questions these scholars have.

  • Read more details in Aschwin’s thesis [pdf].
  • Have a look at the software at , and maybe start your own Filmtagging instance
  • Test the annotation platform yourself at or watch the screencast below

Share This:

MSc. Project: The search for credibility in news articles and tweets

[This post was written by Marc Jacobs and describes his MSc Thesis research]

Nowadays the world does not just rely on traditional news sources like newspapers, television and radio anymore. Social Media, such as Twitter, are claiming their key position here, thanks to the fast publishing speed and large amount of items. As one may suspect, the credibility of this unrated news becomes questionable. My Master thesis focuses on determining measurable features (such as retweets, likes or number of Wikipedia entities) in newsworthy tweets and online news articles.

Credibility framework pyramid

The gathering of the credibility features consisted of two parts: a theoretical and practical part. First, a theoretical credibility framework has been built using recent studies about credibility on the Web. Next, Ubuntu was booted, Python was started, and news articles and tweets, including metadata, were mined. The news items have been analysed, and, based on the credibility framework, features were extracted. Additional information retrieval techniques (website scraping, regular expressions, NLTK, IR-API’s) were used to extract additional features, so the coverage of the credibility framework was extended.

The data processing and experimentation pipeline

The last step in this research was to present the features to the crowd in an experimental design, using the crowdsourcing platform Crowdflower. The correlation between a specific feature and the credibility of the tweet or news article has been calculated. The results have been compared to find the differences and similarities between tweets and articles.

The highly correlated credibility features (which include the amount of matches with Wikipedia entries) may be used in the future for the construction of credibility algorithms that automatically assess the credibility of newsworthy tweets or news articles, and, hopefully, adds support to filter reliable news from the impenetrable pile of data on the Internet.

Read all the details in Marc’s thesis

Share This:

MSc. Project Roy Hoeymans: Effective Recommendation in Knowlegde Portals – the SKYbrary case study

[This post was written by Roy Hoeymans. It describes his MSc. project ]

In this master project, which I have done externally at DNV-GL, I have built a recommender system for knowledge portals. Recommender systems are pieces of software that provide suggestions for related items to a user. My research focuses on the application of a recommender system in knowledge portals. A knowledge portal is an online single point of access to information or knowledge on a specific subject. Examples of knowledge portals are SKYbrary ( or Navipedia (

skybrary logoPart of this project was a case study on SKYbrary, a knowledge portal on the subject of aviation safety. In this project I looked at the types of data that are typically available to knowledge portals. I used user navigation pattern data, which I retrieved via the Google Analytics API, and the text of the articles to create a user-navigation based and a content based algorithm. The user-navigation based algorithm uses an item association formula and the content based algorithm uses a tf-idf weighting scheme to calculate content similarity between articles. Because both types of algorithm have their separate disadvantages, I also developed a hybrid algorithm that combines these two.

Screenshot of the demo application
Screenshot of the demo application

To see which type of algorithm was the most effective, I conducted a survey to the content editors of SKYbrary, who are domain experts on the subject. Each question in the survey showed an article and then recommendations for that article. The respondent was then asked to rate each recommended article on a scale from 1 (completely irrelevant) to 5 (very relevant). The results of the survey showed that the hybrid algorithm algorithm is, which a statistical significant difference, better than a user-navigation based algorithm. A difference between the hybrid algorithm and the content-based algorithm was not found however. Future work might include a more extensive or different type of evaluation.

In addition to the research I have done on the algorithms, I have also developed a demo application in which the content editors of SKYbrary can use to show recommendations for a selected article and algorithm.

For more informaton, view Roy Hoeymans’ Thesis Presentation [pdf] or read the thesis [Academia].

Share This: