Graphing a Lesson Learned Database
Ask any project manager and they will tell you the importance of reviewing lessons learned prior to starting a new project. The lesson learned databases are filled with nuggets of valuable information to help project teams increase the likelihood of project success. Why then do most lesson learned databases go unused by project teams? In my experience they are difficult to search through and require hours of time to review the result set. Recently I had a project engineer ask me if we could search our lessons learned using a list of 22 key terms the team was interested in. Our current keyword search engine would require him to enter each term individually, select the link, and save the document for review. By the way, there was no way to search only the database, the query would search our entire corpus, close to 20 million urls. This would not do. I asked our search team if they would run a special query against the lesson database only, using the terms provided. They returned a spreadsheet with a link to each document containing the term or terms. Over 1100 documents were on the list. The engineer had his work cut out for him.
I started thinking there had to be a better way. I had been experimenting with topic modeling, in particular to assist our users in connecting seemingly disparate documents through an easier visualization mechanism. Something better than a list of links on multiple pages. I gathered my toolbox, R/RStudio
, for the topic modeling and exploring the data, Neo4j
, for modeling and visualizing the topics, and Linkurious
, a web front end for our users to search and visualize the graph data base.
Build the Topic Model
All the code and data can be found at my Github repository. In this article, I will focus on the R
and Cypher
code to work with Neo4j. I will demonstrate the topic modeling code in another article.
For this demonstration, I exported over 2000 lessons from our NASA Engineering Network lesson learn public database. The data was fairly clean and contained some useful metadata.
The steps for creating the topic model are:
- Create the corpus.
- Create the document term matrix to use in the model
- Determine the optimal number of topics for the model.
- In topic modeling the k topics needs to be know. I currently use a harmonic mean approach. For this data, 35 topics was determined to be optimal
- Run the model
Since documents can be in more than one topic, I extract theta, the per-document probabilities, from the model and assign each document its most diagnostic topic. In turn, I extracted the top thirty terms for each topic, in rank order, created a label for each topic based on the top three terms, and correlated the topics based on the Category metadata associated with each document. All of this information would help develop my graph model.
I am jumping ahead here, however I wanted to show how you can visualize your graph database in R using the RNeo4j
and visNetwork
package, see Nicole’s blog, Visualize Your Graph with RNeo4j and visNetwork. Assume I have already created the database. I actually created the database and queries using RNeo4j, but that is for another time. The first query I ran returned the term “contamin”
and the topics the term was in. The edge weight was determined by the rank of the term in the topic. The thicker the edge the higher the term ranked in the topic. The next query returned all the lessons for Topic 27
. Finally, I visualized the model in R using the LDAvis
package. I could see the top thirty terms for each topic and how those terms were distributed among the other topics. Helpful to me, but not what I need for the end user. I saved all the lesson information in csv files to import into Neo4j.
Building the Graph Database
If you are just beginning to work with graph databases and Neo4j, you need to read Nicole White’s blog. The next section is based on her webinar Using Load CSV) in the Real World.
Having been born in the eight track era, I decided to first model my graph on a white board. I drew a rough outline showing the various connections of the proposed nodes in the graph. It is fairly simple, however I found it helps keep me focused on the model.
To create the nodes, I imported the data I created above using LOAD CSV
in cypher. The first section of code creates a unique constraint for the lesson node, preventing any duplications of lesson ids. Reading the csv file, a lesson node is created and properties are set. The properties were extracted from the metadata associated with each lesson. As Nicole suggested in her webinar, I split the date up into three parts. I will use the year property later to assign a weight to an edge. The newer the lesson the larger the edge will be. Each lesson node will also contain as a property, the abstract, if available, the lesson itself, the directorate the lesson originated from, whether it was a safety issue, and a link to the lesson on the NEN website.
// Nodes created for Lessons, Submitter, Center and Topic
// Relations created
// Uniqueness constraints.
CREATE CONSTRAINT ON (l:Lesson) ASSERT l.name IS UNIQUE;
// Load.
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS
FROM 'file:e:/Users/David/Dropbox/doctopics/data/llis.csv' AS line
WITH line, SPLIT(line.LessonDate, '-') AS date
CREATE (lesson:Lesson { name: TOINT(line.`LessonId`) } )
SET lesson.year = TOINT(date[0]),
lesson.month = TOINT(date[1]),
lesson.day = TOINT(date[2]),
lesson.title = (line.Title),
lesson.abstract = (line.Abstract),
lesson.lesson = (line.Lesson),
lesson.org = (line.MissionDirectorate),
lesson.safety = (line.SafetyIssue),
lesson.url = (line.url)
Many of the nodes are associated with multiple lessons in the csv file. The MERGE
command will create the node if it does not exist or match it if it does.
// Merges multiple entries of node in csv file
MERGE (submitter:Submitter { name: UPPER(line.Submitter1) })
MERGE (center:Center { name: UPPER(line.Organization) })
MERGE (topic:Topic { name: TOINT(line.Topic) })
MERGE (category:Category { name: UPPER(line.Category) })
Once the nodes are completed I create the relationships between the nodes. A lesson is contained in a topic, was written by a submitter, occurred at a NASA Center and fell into a particular category.
CREATE (topic)-[:Contains]->(lesson)
CREATE (submitter)-[:Wrote]->(lesson)
CREATE (lesson)-[:OccurredAt]->(center)
CREATE (lesson)-[:InCategory]->(category)
;
Here is the finished result showing two lesson nodes and their relationships.
In the R code, I calculated the most representative Category for each topic and saved it into a csv file, which I now load into the graph database. Since the nodes were already previously created, I use the MATCH
function to get the node then create the relationship of the topic to the category.
// Topic, category.
// Adds the category nodeand creates a relations to the topic
// Load.
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS
FROM 'file:e:/Users/David/Dropbox/doctopics/data/topicCategory.csv' AS line
MATCH (topic:Topic { name: TOINT(line.Topic) })
MATCH (category:Category { name: UPPER(line.Category) })
CREATE (topic)-[:AssociatedTo]->(category)
;
Having correlated the Topics, I can add that relationship to the nodes. Since I am using an undirected relationship I discovered I needed to use MERGE
to create the edge. I can use this property later to query the database to find other Topics which might contain lessons that have a connection to or happen in conjunction with lesson I am interested in.
// Topic, Correlation.
// Adds a relation to each topic using their correlation as a property of the relationship
// Load.
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS
FROM 'file:e:/Users/David/Dropbox/doctopics/data/topicCorr.csv' AS line
MATCH (topic:Topic), (topic2:Topic)
WHERE topic.name = TOINT(line.Topic) AND topic2.name = TOINT(line.ToTopic)
MERGE (topic)-[c:CorrelatedTo {corr : TOFLOAT(line.Correlation)}]-(topic2)
// MERGE is used because it is non directed relationship
;
The following cypher code, creates the nodes for the top thirty terms in each topic, creates the edge between the term node and the topic node, setting the rank relationship property to the terms rank in the topic and lastly sets the label property in the topic node the label generated in the R code.
// Topic, Terms.
// Creates term nodes and relationship to topic by the rank the
// term is in the topic. A rank of 1 means that term is the most
// frequent in that topic
// Load.
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS
FROM 'file:e:/Users/David/Dropbox/doctopics/data/topicTerms.csv' AS line
MATCH (topic:Topic { name: TOINT(line.Topic) })
MERGE (term:Term { name: UPPER(line.Terms) })
CREATE (term)-[r:RankIn {rank : TOINT(line.Rank)}]->(topic)
;
// Topic, Labels.
// Creates label porperty for each topic by using the top 3 ranked terms is in the topic. A rank of // 1 means that term is the most frequent in that topic
// Load.
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS
FROM 'file:e:/Users/David/Dropbox/doctopics/data/topicLabels.csv' AS line
MATCH (topic:Topic { name: TOINT(line.Topic) })
SET topic.label = line.Label
;
To show you how to extract some information form the graph database in neo4j, here are some sample queries and their results. Both the graph and row version is displayed. This query returns the lessons in Topic 27
.
MATCH (n:Topic)-[r:Contains]->(m:Lesson)
WHERE n.name = 27
RETURN n, m,
n.name AS from,
m.id AS to,
m.title AS title,
(m.year) AS value,
"In" AS label
Topics correlated to Topic 27
with a correlation greater than 0.02
MATCH (n:Topic)-[r:CorrelatedTo]->(m:Topic)
WHERE n.name = 27 AND (r.corr > 0.02)
RETURN n, m,
n.name AS from,
m.id AS to,
m.title AS title,
(m.year) AS value,
"In" AS label
Topics with correlation greater than 0.40. I am still working on this as it also shows all the correlations for each topic.
MATCH (n:Topic)-[r:CorrelatedTo]->(m:Topic)
WHERE r.corr > 0.40
RETURN n, m,
n.name AS from,
m.name AS to,
(r.corr) AS value
All topics correlated to Topic 27
. The edge label is the numeric correlation value.
MATCH (n:Topic)-[r:CorrelatedTo]->(m:Topic)
WHERE n.name = 27
RETURN n, m,
n.name AS from,
m.name AS to,
(r.corr) AS value
User Visualization
Now I have all the data in my graph database, yet I still need to make it easier for my end users to search and connect lessons based on their criteria. In Learning Neo4j”, several visualization options were mentioned. I evaluated several of the applications and for this demonstration I settled on Linkurious, a web based interface for users to search and visualize graph data. Linkurious was designed to connect to a neo4j database and requires the neo4j database to be running before you start it. Upon opening the application you are presented with this dashboard.
Since Linkurious comes with a built in instance of Elastic Search, you can search through all the nodes and edges if you desire.
The real benefit comes when you create your first visualization. Let us assume I am looking for lessons that may contain the terms, fuel
, water
, valve
or failure
. I am shown the Lesson node with lessons containing some of the words, the Term
node for two of the terms, and a Topic
node that contains three of the terms as its label. Remember, the label was created using the top three terms for each topic. Therefore, it is safe to assume the terms appear frequently in lessons contained in Topic 2
and the lessons pertain to fuel valve and/or water valve issues.
By clicking on Topic Two the node appears on my canvass.
I can begin exploring the topic and uncover relationships. Since the topic is related to four other nodes, I am given an option to select the nodes I want to display on the screen. Selecting the Lesson node, I am able to display all of the lessons contained in this topic. You can make adjustments to the visualization to add color and size. In this case, each node type is a different color and the node size is determined by the year the lesson was written. The newer the lesson the larger the node. Properties for each node are displayed to the left. In the image below, information on the highlighted node can be seen and if the user wants to visit the site where the lesson is stored, they can click on the url in the properties.
I can continue to explore my data. By clicking on the highlighted lesson node, its edges are displayed, giving me the author of the lesson, the Center
it occurred at and the Category
it is associated to.
Returning to the topic node, I can click on it again, this time selecting to display the Categories associated to it. I can then explore the Category node to see lesson associated with Risk Management
, Energy
, Power
or Ground Support Systems
, allowing me to find other lessons that are closely aligned to what I am looking for. You cannot see these connections in a keyword search list.
I have just begun to scratch the surface and I am still learning. There are many more examples I could show, but I will leave some of the exploring to you. I believe Neo4j, R/RStudio and Linkurious will allow me to explore and visualize my data in ways our current search engine cannot do. While this is still a work in progress, I believe using a graph database in this manner can provide users with a more effective search experience reducing their time to find answers and allowing them to start their project on the right foot. The combination of these tools, R/RStudio, Neo4j and Linkurious give analyst the ability to create excellent analytical and visual representations of large document repositories. Feel free to reach out at to me on Twitter, @davidmeza1, if you have any questions or suggestions.