To explain what the Nepomuk Web Extractor project is, I should start with a short introduction about Nepomuk itself. Nepomuk – as KDE project – is a RDF database and API that stores information about and the relations between almost everything that exists and happens on your computer. As so, it stores metadata about files, your projects (although this part is still in heavy development) and much more. Besides its basic services which are responsible for storing and manipulating records in the database, Nepomuk provides user-oriented services. The most well-known is Strigi. This service indexes all your files on your system, extracts all available information from them and store in Nepomuk. However, Strigi:
- Usually extracts information only from one file in one step. It will not create a relation between a video file and subtitle if the video is on your USB flash drive and the subtitles are on the network drive.
- Extracts information from files only. Strigi doesn't check if information is valid from the user's point of view. It will not complain about a music file that has incompatible author and name parameters set (like Mozart, 'In the Hall of the Mountain King').
The main idea of the Nepomuk Web Extractor Service is to automatically retrieve and correct as much information as possible for all resources in the Nepomuk database using some external databases or existing relations and data in Nepomuk itself as sources.
Nepomuk Web Extractor
There are some issues which led to the creation of the Nepomuk Web Extractor:
- There are many tasks that can be done automatically, yet currently they still require some manual intervention. What about automatically linking subtitles files to the appropriate video files no matter where they are actually stored and what their names are? Right now most of the current players pick up subtitles only from the folder where video file is located and only if the naming schemes of the video file and subtitle file match.
- File metadata may contain incorrect information, like in the aforementioned example 'Mozart, In the Hall of the Mountain King'. What about automatically correcting this metadata in Nepomuk and writing information back to the file?
- Nepomuk ontologies provide schemes for describing various resources. To mark a file as TV show, you should create a Nepomuk resource describing the whole TV series, then take the Nepomuk resource that describes this file, adjust it -mark it as TV show, set its episode number, season number, optionally overview and then link this adjusted resource to the newly created TV series resource and link the TV series resource back to this resource. Why should you do this all manually, if there is a TheTVDB.com database and all this information can be automatically retrieved after parsing file names?
Nepomuk Web Extractor main ideas
As with any system that tries to do something automatically and to be 'smart' we have to solve some common issues, like 'What if user provides incorrect source data' or 'What if information in external source is incorrect'? How should we extract information for a music file with metadata like 'Mozart, In the Hall of the Mountain King'? Is it 'Mozart, Requiem', or 'Grieg, In the Hall of the Mountain King'? How should we choose between these variants and more importantly, should we try to choose, or simply refuse to parse this file untill more information will be supplied?
Trying to solve these questions, the following guidelines and system design were created:
- The system assumes that metadata about a file contains more correct information then incorrect.
- The system assumes that information retrieved from external data sources is the more correct if more external data sources report it as correct. E.g. if OpenStreetMap and Google Maps told us that X == A, and Bing Maps told us that X == B, then variant X == A (two votes against one vote) will be chosen. It should be noted that the real system for choosing between different variants is more complex that described above. The variant X==B may be chosen if e.g. Bing Maps told us that it is 'absolutely sure that X==B', and OSM and GM told only 'probably X==A'.
- 'It is better to not supply information at all rather then to supply wrong information'. That is because suppling wrong information may lead to the situation when metadata about resources will contain more incorrect info than correct info, and that leads to more incorrect information supplied to other connected resources and so on, until most information about this resource, and the resources connected to it will become incorrect.
The power of Nepomuk Web Extractor is in its plugins. I have tried to provide API's which are as convenient as possible for the (future) plugin creators as I can. Generally speaking, if you don't need something extraordinary, there is a special script that you may run and it will create a template for your plugin. After that you should implement a method responsible for extracting info from your source and most likely a method for reading configuration. If you need something special, then you should still run a script, but will have to do more changes. More info about plugin creation is (will be) available in the wiki.This is the list of available plugins:
- An autotag plugin which assigns a tag if a file name matches a regular expression.
- A TheTVDB plugin (written by Sebastian Trueg) can extract info from TheTVDB.com website.
- The Tesseract plugin for analyzing images using OCR and extracting text from it. This plugin is currently lost somewhere in the repository, but I hope I will find it back.
This is unfortunately a very small amount. There are a lot of ideas for plugins, but we need more developers.
Plans for the future
Although the TODO list is big, the main goals for the future are:
- More plugins. Much more plugins. That is one of the most important issues.
- A more convenient API for the creation of plugins. Currently, plugins must be implemented in C++. I think that the ability to write them in scripting languages like Python will make the process of plugin creation easier. And plugins created with scripting languages can be distributed via GetHotNewStuff.
- While moving to DMS some functionality was lost. Currently it is not possible to replace one piece of information with another, so there is no way to fix errors with the Web Extractor. I hope I will be able to restore this ability in the near future.