Research Interests
Direct Links
Relevant Links


The Wikipedia Analysis Project

Wikipedia analysis is great fun. In fact, even though many of my findings are exciting and some are quite surprising, they are typically unpublishable in scientific literature. Some of them, I'll try to put here.

Table of Contents

Why Wikipedia?

Well, for many reasons. Most importantly, because it's a perfectly documented collaborative effort of a big community driven by many (sometimes opposing each other) forces. I perceive Wikipedia as a thermometer of society - an ever changing repository of knowledge reflecting the current state of mind of the people caring about it.

Is it representative? Certainly not. But I believe it does serve as a lacmus paper of the society since it's created by some of the most active people, contributing to every aspect of the cultural, political and scientific life of their surrounding.

I believe that Wikipedia research is quite valuable for a large number of disciplines. I'm coming from Physics, so for me, Wiki is one of the best, well-documented examples of complex systems approaching thermodynamic limit. Notice, the entire evolution, every tiny change of the Wiki is available for analysis. It's a great challenge to process it (English Wiki exceeds 1TB), but that's just one of the things I enjoy doing.

Obviously, Wiki is a great example of taxonomy. It traces the evolution of a social system, a community collaboratively working on the same project without well-defined goal. Behavioral patterns of humans are embedded into Wiki's history.

Wiki is linked in real time to the processes occurring in society (at some point I might put online some material demonstrating it). It's not politically correct. It's for real. It just exposes how people think and not how they think they think (which is usually discovered though questionnaires).

What do I study?

Wikipedia contains an unlimited source of information. One can ask billions of questions and get surprised with answers. Basically, I'm interested in networks which are contained in great numbers in Wikipedia. Articles are linked through references. There are links between Articles and Categories. Categories and themselves. Affiliation network of people and the articles edited by them. References between Wiki pages in different languages. I deeply believe that these networks and especially their microscopic dynamics contain valuable information about us and our society.

Of course, Wiki can be studied on other levels. But I try to avoid the "dirt" of content analysis and tend to persuade myself that much can be done without it. So, let's focus on networks.


Unfortunately, I'm unable to publish majority of my findings. They are too abstract for my field. Want an example? At some point I tried some novel ranking algorithm which produced the most "important" terms in each language. Most languages showed similar patterns, but the exceptions where exciting. For example, each language listed their own country first. Next, there usually came countries most influential for the corresponding culture. For example, it was USA, France, Germany, Italy and GB for Russian. Spanish-speaking countries for Spanish. France for Italy. USA for everybody. The funny exception in this case was Israel, where the score for USA exceeded the score for Israel itself (israelis can surely understand the irony).

Further, important cities of the language-speaking country are usually listed. For Russia, these were Moscow and St. Petersburg and that's it. The next city is very very far away.

France was even funnier. They have only one highly-ranked country listed besides France - Belgium. Then, one gets every possible France-related term - areas of France, ministries, etc. etc. France, France, France. As if there existed nothing else.

Italy had movie industry - related terms very high in their hierarchy. Spain - Formula 1 - related terms.

To sum things up. The results are fun but hardly quantifiable. Hence, unpublishable in physics literature. Of course, I was able to compile a nice paper reconstructing Wiki's hierarchy. But it's a small portion of the findings.