Network map of Knowledge and Art

Finally, after weeks, I have the time and the energy to post complex content. I wrote the essay below for a online course on network analysis. The overall experience was great as I discovered a world of possibilities for the discipline; any type of connection: power , conspiracy, social or knowledge network can be analyzed with the same underlining theory.

Abstract

I wish to propose a network model to map the knowledge and ideas of the people contained in Wikipedia. The methodology of the creation of the dataset is generic and can be re-applied to any category of Wikipedia. The algorithms used were successful in identifying the clusters and to provide some insights on the dynamics of knowledge. The analysis is performed by utilizing different metrics such as modularity, weighted degrees and eccentricity. A small world test according to the Watts and Strogatz model is performed as well. You can find a printable and zoomable version of the full map here or the high res image here.

Dataset and methodology

I obtained the network information by performing a set of queries on dbpedia, a structured repository of the Wikipedia project. The database allows everyone to perform complex interrogations using the SPARQL code. The code is reported below.

SELECT *
WHERE {
?p a
<http://dbpedia.org/ontology/Person> .
?p <http://dbpedia.org/ontology/influenced> ?influenced.
}

SELECT *
WHERE {
?p a
<http://dbpedia.org/ontology/Person> .
?p <http://dbpedia.org/ontology/influencedBy> ?influencedBy.
}

The queries list all the people contained in dbpedia with non-null values in the field “influenced” or “influencedBy”. After some manipulation in Excel to concatenate the results of the two queries and to fix some UNICODE issues, the table looks like the one reported in Table below listing the name of the influenced (target) and influencer (source). All the rest of the analysis is performed in Gephi.

Influencer and influenced in Wikipedia People category

Nodes represent people (writers, artists, actors, etc); edges are created whenever the Wikipedia infobox contains the name of the other node. For example Jean-Jacques Rousseau influenced Kant and was influenced by Cicero.

Example on how relationships are constructed in my simple model.

The query is designed to be as broad as possible, limited only by the number of records in the database that had exercised or experienced some influence over others and spans across time, space and domain. The analysis is limited by the data quality contained in Wikipedia which is strongly biased towards the western culture (ref. ATTACHMENT 3 – Geotagged articles in Wikipedia). I will discuss the other limitations further in the text.

Gallery of selected clusters

: Actors

: Contemporary painters and artists

: Golden Age Latin – Classical Authors

: Contemporary Authors #1

: Contemporary Authors #2

: Contemporary sculptors

: The influence of some nodes is under-estimated

: Different nodes sizes according to eccentricity.

: Influencer and influenced in Wikipedia People category

: Example on how relationships are constructed in my simple model.

: Network map of Knowledge and Art

Analysis

The network is a direct graph counting 13 814 nodes and 23 487 edges. Due to limitation of the computational power I filtered out the nodes with outdegree lower than 2. This yields a significantly more agile graph consisting in 2986 nodes (21%) and edges 7643 (32%).

I used the Force Atlas 2 sorting algorithm and a few adjustment performed by the Noverlap one. For the first part of the analysis the size of the nodes is proportional to the out degree, colors are according to the Modularity class.

Communities identification – Modularity

Modularity measures are important to identify communities. The choice of Modularity class as partitioning criteria seems to be adequate; writers are mainly represented by light green, philosophers are red, artists are in pink and modern writers are blue. The Force Atlas 2 algorithm did a satisfactory job as well. The nodes are as a matter of fact arranged in a meaningful manner with the Philosophers at the bottom, the most celebrated writers in the middle and the modern ones at the top, suggesting ideas had moved according to a specific path.

The interface area between philosophers and writers reported below. is of particular interest as it rightly captures the transition between the two domains as well known philosophical authors are correctly placed.

Who is the most influencial? – Weighed degree

At first, the choice of out degree as sizing factors for nodes appears to be justified, the major names are very visible and the influence is well represented. However, some details of the model are not right. For example, contributions of Confucius one of the most influential Chinese Philosopher, are clearly under-estimated, similar considerations do apply for example to Homer Thales and John Milton. In some cases it might be an issue of the bias towards the West culture or issues in the data format (ref.
ATTACHMENT 1 – Full Graph) as the field mix both names and categories such as Plato, who according to the infobox of Wikipedia “Most of subsequent western philosophy, including…”. Another reason is that the model adopted is simple and does not contemplate more than one level of influence; therefore the founders of movements tend to be under-represented. For example Arthur Schopenhauer influenced Friedrich Nietzsche but, Schopenhauer himself was influenced by Giordano Bruno and the chain can continue.

Averroes => Giordano Bruno => Arthur Schopenhauer => Friedrich Nietzsche

An interesting continuation would be to map deeper the relationship chains thus generating exponentially bigger number of edges and include more complex influence patterns in the model, the steep decreasing curve of out-degree distribution highlight this limitation.

The influence of some nodes is under-estimated

Are we talking (and listening) to each other? Small World hypothesis

In order to test the hypothesis of small network according to Watts and Strogatz model we need to satisfy two conditions:

The network fails the test as the low average shortest path of 5.53 is different from the ln(13814) = 9.53. The implication of this is that, according to this model ideas and knowledge are segregated between clusters. This outcome is somehow confirmed by the historical constrains of movement of ideas and people (e.g.: a few German philosophers were studying Confucius or other Chinese thinkers), this is most likely destined to change in the near future.

Interpretation, conclusions and further developments

The analysis showed some interesting insight regarding the knowledge and how ideas are propagated, however it is negatively influenced by some current limitations of Wikipedia, specifically the western bias and other issues discussed in the essay.
Further development would need to address the lack of fit for the influence by developing two improvements: firstly a more realistic chain of connections, not limited to only one level, secondly the definition of an improved ranking function, third the enrichment of the dataset with more unbiased information (maybe complementing the current one with other data on citations).

References

Blog article Drunks&Lampposts by SIMONRAPER available at: http://drunks-and-lampposts.com/2012/06/13/graphing-the-history-of-philosophy/

Easley & Kleinberg, Networks, Crowds and Markets

Lectures and slides from Coursera Lada Adamic SocialNetwork Analysis.

Ulrik Brandes, A Faster Algorithm for Betweenness Centrality, in Journal of Mathematical Sociology 25(2):163-177, (2001)

Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, Etienne Lefebvre, Fast unfolding of communities in large networks, in Journal of Statistical Mechanics: Theory and Experiment 2008 (10), P1000

Wikipedia article on Eccentricity available at http://en.wikipedia.org/wiki/Eccentricity_(graph_theory)

15 thoughts on “Network map of Knowledge and Art”

Bot's Blog 24 November, 2012 at 11:06

Very nice article, Paolo. I’m wondering after you retrieved the database, what software did you use to visualize the graph in large scale? I really wish MATLAB has such visualization tools so that I can visualize my functional brain network. Thank you very much, Paolo. Keep up your good work! –Bot

Reply ↓
1. Paolo Negrini Post author24 November, 2012 at 12:39
  
  hanks Bot, I used a software called Gephi (https://gephi.org/), it’s a great – and free – network visualization software. I am pretty sure we can import your network from Matlab into Gephi and do some network analysis. Drop me a line if interested.
  Ciao,
  Paolo
  
  Reply ↓
  1. Bot's Blog 24 November, 2012 at 21:31
    
    Thanks for your quick response, Paolo! I’m still working on it, and once I figure out a good brain network to visualize I will let you know.
Azad 25 November, 2012 at 07:56

How to interpret the database query that you did?

Reply ↓
1. Paolo Negrini Post author25 November, 2012 at 12:39
  
  Hi Azad, the queries are written in SPARQL, it’s a query language used to mine data out of datasets stored in Resource Description Framework like Wikipedia and other big data systems. I am not an expert but you can check how it is defined here (http://www.w3.org/TR/rdf-sparql-query/) and here (http://wiki.dbpedia.org/Datasets) and some examples here (https://wiki.base22.com/display/btg/SPARQL+Query+Examples)
  
  Reply ↓
  1. Azad 25 November, 2012 at 18:25
    
    Thanks Paolo!!
Pingback: Trawling Wikipedia to produce network graphics « Martin House Consulting
abusettilvaro Busetti 3 December, 2012 at 12:15

Great job !!! congratulations…

Reply ↓
Pingback: Data in Social Networks (Part 1) | Technifying!
Pingback: Network Map of Knowledge and Art V2.0: Preview | Paolo's blog
Jodi 7 July, 2013 at 12:24

Just desire to say your article is as surprising. The clarity in your submit is simply great and i could think you are knowledgeable
on this subject. Fine along with your permission let me to
grab your RSS feed to stay updated with coming near
near post. Thank you one million and please continue the enjoyable work.

Reply ↓
1. Paolo Negrini Post author8 July, 2013 at 10:16
  
  Hi Jodi, Thanks for the kind words. I like to dig on some issues and post the results here (when I have time). You’re most welcome to subscribe to the RSS feed.
  
  Reply ↓
Ali Gajani 20 December, 2013 at 19:49

Hey Paolo, how did you clean excel files to include names and how did you merge them into one. I would love to know the process. Thank you

Reply ↓
Datherine 7 January, 2015 at 07:40

Do you have more great arictles like this one?

Reply ↓
Pingback: Social Network Analysis – Articles – tutormentorexchange