Finally, after weeks, I have the time and the energy to post complex content. I wrote the essay below for a online course on network analysis. The overall experience was great as I discovered a world of possibilities for the discipline; any type of connection: power , conspiracy, social or knowledge network can be analyzed with the same underlining theory.
Abstract
I wish to propose a network model to map the knowledge and ideas of the people contained in Wikipedia. The methodology of the creation of the dataset is generic and can be re-applied to any category of Wikipedia. The algorithms used were successful in identifying the clusters and to provide some insights on the dynamics of knowledge. The analysis is performed by utilizing different metrics such as modularity, weighted degrees and eccentricity. A small world test according to the Watts and Strogatz model is performed as well. You can find a printable and zoomable version of the full map here or the high res image here.
Dataset and methodology
I obtained the network information by performing a set of queries on dbpedia, a structured repository of the Wikipedia project. The database allows everyone to perform complex interrogations using the SPARQL code. The code is reported below.
SELECT * WHERE { ?p a <http://dbpedia.org/ontology/Person> . ?p <http://dbpedia.org/ontology/influenced> ?influenced. }
SELECT * WHERE { ?p a <http://dbpedia.org/ontology/Person> . ?p <http://dbpedia.org/ontology/influencedBy> ?influencedBy. }
The queries list all the people contained in dbpedia with non-null values in the field “influenced” or “influencedBy”. After some manipulation in Excel to concatenate the results of the two queries and to fix some UNICODE issues, the table looks like the one reported in Table below listing the name of the influenced (target) and influencer (source). All the rest of the analysis is performed in Gephi.
Nodes represent people (writers, artists, actors, etc); edges are created whenever the Wikipedia infobox contains the name of the other node. For example Jean-Jacques Rousseau influenced Kant and was influenced by Cicero.
The query is designed to be as broad as possible, limited only by the number of records in the database that had exercised or experienced some influence over others and spans across time, space and domain. The analysis is limited by the data quality contained in Wikipedia which is strongly biased towards the western culture (ref. ATTACHMENT 3 – Geotagged articles in Wikipedia). I will discuss the other limitations further in the text.
Gallery of selected clusters
Analysis
The network is a direct graph counting 13 814 nodes and 23 487 edges. Due to limitation of the computational power I filtered out the nodes with outdegree lower than 2. This yields a significantly more agile graph consisting in 2986 nodes (21%) and edges 7643 (32%).
I used the Force Atlas 2 sorting algorithm and a few adjustment performed by the Noverlap one. For the first part of the analysis the size of the nodes is proportional to the out degree, colors are according to the Modularity class.
Communities identification – Modularity
Modularity measures are important to identify communities. The choice of Modularity class as partitioning criteria seems to be adequate; writers are mainly represented by light green, philosophers are red, artists are in pink and modern writers are blue. The Force Atlas 2 algorithm did a satisfactory job as well. The nodes are as a matter of fact arranged in a meaningful manner with the Philosophers at the bottom, the most celebrated writers in the middle and the modern ones at the top, suggesting ideas had moved according to a specific path.
The interface area between philosophers and writers reported below. is of particular interest as it rightly captures the transition between the two domains as well known philosophical authors are correctly placed.
Who is the most influencial? – Weighed degree
At first, the choice of out degree as sizing factors for nodes appears to be justified, the major names are very visible and the influence is well represented. However, some details of the model are not right. For example, contributions of Confucius one of the most influential Chinese Philosopher, are clearly under-estimated, similar considerations do apply for example to Homer Thales and John Milton. In some cases it might be an issue of the bias towards the West culture or issues in the data format (ref.
ATTACHMENT 1 – Full Graph) as the field mix both names and categories such as Plato, who according to the infobox of Wikipedia “Most of subsequent western philosophy, including…”. Another reason is that the model adopted is simple and does not contemplate more than one level of influence; therefore the founders of movements tend to be under-represented. For example Arthur Schopenhauer influenced Friedrich Nietzsche but, Schopenhauer himself was influenced by Giordano Bruno and the chain can continue.
Averroes => Giordano Bruno => Arthur Schopenhauer => Friedrich Nietzsche
An interesting continuation would be to map deeper the relationship chains thus generating exponentially bigger number of edges and include more complex influence patterns in the model, the steep decreasing curve of out-degree distribution highlight this limitation.
More on influence – Eccentricity
Another way of looking at the influence is to use the eccentricity instead of the out-degree. To my knowledge, this measure was not discussed in the lessons.
From Wikipedia: “The eccentricity of a vertex is the greatest geodesic distance between any other vertex. It can be thought of as how far a node is from the node most distant from it in the graph.”
This measure of centrality provides more accurate interpretation of the influence exerted by a thinker. We can assume that a low eccentricity node will be connected a few peers and therefore have influenced a smaller group. Larger eccentricity might be associated to longer chain of influence (greater distance from other nodes).
This transformation yields more balanced results regarding the contribution of individuals, however the information regarding the directionality of the graph is lost so influencer and influenced are weighted alike. A function defined using different measures might provide a viable alternative, for example we can multiply the put-degree times the eccentricity to capture a certain directionality. The development of this is however outside the boundaries of this essay.
Are we talking (and listening) to each other? Small World hypothesis
In order to test the hypothesis of small network according to Watts and Strogatz model we need to satisfy two conditions:
The network fails the test as the low average shortest path of 5.53 is different from the ln(13814) = 9.53. The implication of this is that, according to this model ideas and knowledge are segregated between clusters. This outcome is somehow confirmed by the historical constrains of movement of ideas and people (e.g.: a few German philosophers were studying Confucius or other Chinese thinkers), this is most likely destined to change in the near future.
Interpretation, conclusions and further developments
The analysis showed some interesting insight regarding the knowledge and how ideas are propagated, however it is negatively influenced by some current limitations of Wikipedia, specifically the western bias and other issues discussed in the essay.
Further development would need to address the lack of fit for the influence by developing two improvements: firstly a more realistic chain of connections, not limited to only one level, secondly the definition of an improved ranking function, third the enrichment of the dataset with more unbiased information (maybe complementing the current one with other data on citations).
References
Blog article Drunks&Lampposts by SIMONRAPER available at: http://drunks-and-lampposts.com/2012/06/13/graphing-the-history-of-philosophy/
Easley & Kleinberg, Networks, Crowds and Markets
Lectures and slides from Coursera Lada Adamic SocialNetwork Analysis.
Ulrik Brandes, A Faster Algorithm for Betweenness Centrality, in Journal of Mathematical Sociology 25(2):163-177, (2001)
Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, Etienne Lefebvre, Fast unfolding of communities in large networks, in Journal of Statistical Mechanics: Theory and Experiment 2008 (10), P1000
Wikipedia article on Eccentricity available at http://en.wikipedia.org/wiki/Eccentricity_(graph_theory)
Very nice article, Paolo. I’m wondering after you retrieved the database, what software did you use to visualize the graph in large scale? I really wish MATLAB has such visualization tools so that I can visualize my functional brain network. Thank you very much, Paolo. Keep up your good work! –Bot
hanks Bot, I used a software called Gephi (https://gephi.org/), it’s a great – and free – network visualization software. I am pretty sure we can import your network from Matlab into Gephi and do some network analysis. Drop me a line if interested.
Ciao,
Paolo
Thanks for your quick response, Paolo! I’m still working on it, and once I figure out a good brain network to visualize I will let you know.
How to interpret the database query that you did?
Hi Azad, the queries are written in SPARQL, it’s a query language used to mine data out of datasets stored in Resource Description Framework like Wikipedia and other big data systems. I am not an expert but you can check how it is defined here (http://www.w3.org/TR/rdf-sparql-query/) and here (http://wiki.dbpedia.org/Datasets) and some examples here (https://wiki.base22.com/display/btg/SPARQL+Query+Examples)
Thanks Paolo!!
Pingback: Trawling Wikipedia to produce network graphics « Martin House Consulting
Great job !!! congratulations…
Pingback: Data in Social Networks (Part 1) | Technifying!
Pingback: Network Map of Knowledge and Art V2.0: Preview | Paolo's blog
Just desire to say your article is as surprising. The clarity in your submit is simply great and i could think you are knowledgeable
on this subject. Fine along with your permission let me to
grab your RSS feed to stay updated with coming near
near post. Thank you one million and please continue the enjoyable work.
Hi Jodi, Thanks for the kind words. I like to dig on some issues and post the results here (when I have time). You’re most welcome to subscribe to the RSS feed.
Hey Paolo, how did you clean excel files to include names and how did you merge them into one. I would love to know the process. Thank you
Do you have more great arictles like this one?
Pingback: Social Network Analysis – Articles – tutormentorexchange