Login

Data Mining with Clojure and Datomic

In this article we take a look at a few items:

  • Applying a "Functional Style" to our code. Making better use of Clojure's programming features.
  • Introduction to Datomic, a database written in Clojure.
  • Use of Clojure Test fixtures.

For our project, we are going to analyze job skills. We use Stack Overflow Careers 2.0 as our data source. For example, I query Stack Overflow Careers 2.0 for jobs within a 10 mile radius of my zip code. I search for job postings that contain the keyword "java" or "clojure". I capture the search results as an RSS XML document. I save the XML document to disk.

We use Datomic to persist our data. We also use Datomic to query our data. We then create reports from the queried data.

Datomic is not a relational database. However, it's pretty simple to define a Datomic schema which includes logical relationships.

For our project we first establish a "snapshot". A "snapshot" simply simply tells us when we collected our data. A "snapshot" includes a short description. Here is the "snapshot" schema definition supplied to Datomic:

{:db/id #db/id[:db.part/db]
  :db/ident :snapshot/time
  :db/valueType :db.type/instant
  :db/cardinality :db.cardinality/one
  :db/doc "time data was extracted. milliseconds"
  :db.install/_attribute :db.part/db}

  {:db/id #db/id[:db.part/db]
  :db/ident :snapshot/description
  :db/valueType :db.type/string
  :db/cardinality :db.cardinality/one
  :db/unique :db.unique/identity
  :db/doc "free form description of snapshot"
  :db.install/_attribute :db.part/db}

  {:db/id #db/id[:db.part/db]
  :db/ident :snapshot/job-set
  :db/valueType :db.type/ref
  :db/cardinality :db.cardinality/many
  :db/doc "List of Jobs obtained during snapshot"
  :db.install/_attribute :db.part/db}
  

Note the last attribute ":snapshot/job-set". That states a "snapshot" contains zero to many "jobs". The cardinality is "many", the datatype is "ref". We are storing references to "jobs". This definition is similar to a relational database "foreign key".

We use the same concept to relate a "job" to "skills". Each "job" has an attribute which stores zero to many "skill" references. A skill being something like "java", "sql", "python", etc.

The complete schema is located on github in schema.dtm.

The process for updating the database is simple. I first process the "skills". I query Datomic to determine whether a skill already exists in our database. If the skill does not exist in our database, we want Datomic to generate a unique identifier, AKA entity id. If the skill already exists, we use the existing entity id.

Here is the relevant code.

;; conn parameter is the database connection
;; queries the database for a particular skill (E.G. "programming")
;; if found, returns the database entity id
(defn get-skill-entity-id [conn skill]
 (let [results (q '[:find ?c :in $ ?t :where [?c :skill-set/skill ?t]]
              (db conn) skill) ]
         (first results)))

;; returns true if the skill can not be found
;; in the database (E.G. do we have "programming" stored in the database)
(defn skill-not-exists? [conn skill]
   (nil? (get-skill-entity-id conn skill)))

(defn process-skill [conn skill]
  (when (skill-not-exists? conn skill)
    (add-skill conn skill)))

Now, let's add the skill.

(defn add-skill [conn skill]
   (let [ temp_id (d/tempid :db.part/user) ]
      @(d/transact conn
      [[:db/add temp_id :skill-set/skill skill]])))

Now that the skills are processed, we can now process jobs. Remember, when we process a job we need to associate zero to many skills. So, we now have entity ids for all the skills. Thus, we can use the skill's entity ids as references for our jobs. Here is a portion of our code which stores a job.

(defmulti add-job (fn [one two three four five] (class five)))

;; job has skill list
(defmethod add-job clojure.lang.PersistentVector
  [conn entity-id title job-key skill-list]
  @(d/transact conn [{:db/id entity-id
                      :jobs/title title,
                      :jobs/job-key job-key,
                      :jobs/skill-set skill-list}]))

;; job does not have any skills 
(defmethod add-job nil
  [conn entity-id title job-key skill-list]
  @(d/transact conn [{:db/id entity-id
                      :jobs/title title,
                      :jobs/job-key job-key}]))

We define 2 method signatures, 1) a job with skills, 2) a job without any skills.

We use the same logic that we used for skills to establish an entity id. We look up the job, if the job is already present in the database we use the existing entity id. If the job does not exist, then let Datomic create a new entity id.

Finally we can add our snapshot to the database. The same logic applies. The snapshot contains zero to many job references. Now that we have an entity id for each job. We use the job entity ids are references in our snapshot.

Storing Functions in Datomic
Like other databases, you can store "functions" in Datomic. Datomic currently supports either Java or Clojure functions. For our project, I stored a "calc-percent" function. The "calc-pecent" function serves as a "proof-of-concept".

I defined the "calc-percent" function is an external file, along with the other schema statements. The file is called functions.dtm.

[{:db/id #db/id [:db.part/user]
    :db/ident :calc-percent
    :db/fn #db/fn {:lang "clojure" :params [skill tot freq]
    :code [(format "%s %d %.4f" skill freq (* (float (/ freq tot)) 100) )]}}]

Note, the identifier clause ":db/ident :calc-percent". We can use the stored function like this:

;; conn is the database connection
;; db is function that creates are reference to the current database
;; invoke is a datomic function which calls our stored function
(d/invoke (db conn) :calc-percent "java" 20 5)

The result looks like this:

java 5 25.0000 

If we enhance our application, and create a more complex calculation. Storing the function in the database becomes very useful. We enforce integrity on the function, just like we do with the data. A nice feature, really easy to use.

Using Datomic to Query Data
Querying data is fairly simple. The query statements are nuanced, so let's look at an example. Here we want to get the details of a single snapshot. That means we get a list of jobs. For each job we'll get the title, the job key (a unique identifier obtained from the data source, Careers 2.0), a vector of skills (E.G. ["java" "python" "sql"]).

;; conn paramenter is a database connection.
;; description is a short narrative we used to name our snapshot. E.G. "June 1".
(defn get-snapshot-details [conn description]
   (q '[:find ?title ?job-key (vec ?skill-val)
               :in $ ?desc
               :where [?t :snapshot/description ?desc]
               [?t :snapshot/job-set ?job-set]
               [?job-set :jobs/title ?title]
               [?job-set :jobs/job-key ?job-key]
               [?job-set :jobs/skill-set ?skill-ref]
               [?skill-ref :skill-set/skill ?skill-val]] (db conn) description))

The top line of the query describes our return values (?title, ?job-key and (vec ?skill-val) ). Note, we can use simple Clojure expressions like "vec". A more complex Clojure expression like (map #(... (vec..)) is not currently supported.

The next line ":in" describes the query parameters. In this case we are querying with a snapshot description. A free-form text summary we used to name the snapshot. The parameter value is stored in ?desc. As an example, let's assume we stored a snapshot called "My First Snapshot".

The ":where" clause contains the nuances.

[?t :snapshot/description ?desc]

":snapshot/description" is our database attribute. If we add a incoming parameter to the right of the attribute we are creating a logical "where". So in this case we are stating "find the snapshot who's value is 'My First Snapshot' ". To the left of the attribute is ?t. ?t contains the internal Datomic identifier we use to join the rest of the attributes.

Next attribute in the query is "[?t :snapshot/job-set ?job-set]". Note! We used ?t to join snapshot description and job-set attributes. This time to the right of the attribute we are introducing a new symbol ?job-set. Because ?job-set is a new symbol (not defined anywhere else in the query), Datomic automatically uses the symbol to store the entity id of the snapshot, job-set attribute.

Next attribute [?job-set :jobs/title ?title]. We are joining the snapshot job references to the individual jobs. Again the statement is nuanced. The symbol ?title is defined in the queries :find clause. Thus, Datomic retrieves the job title value and stores it in ?title.

The same logic applies to the rest of the query. We join the job to it's related skills. We store the skill values in ?skill-val. Again, ?skill-val is defined in the :find clause. Thus Datomic uses ?skill-val for capturing output.

My advice is to practice in an interactive Clojure REPL. Once you get a basic query going, the REPL is a really great environment for exploring more of Datomic's features.

Processing XML. A more functional approach.
Here I wanted to show how I gradually changed my code as I learn and garner more experience with Clojure.

As I stated above, our data source is a RSS XML document. Skills are stored in

<category>css</category><category>javascript</category><category>html</category>"

When we process the XML document we want to store the relevant data in a simple Clojure "record". A Clojure record being a type of map structure. A record is series of key/value pairs. Our job data is stored in the following record:

(defrecord Node [title pubDate link job-key categories ])

So a job, stored in a "Node" record, looks like:

{ :title "Senior Programmer" :pubDate "2014-01-01" :job-key 56783 :categories ["java" "python" "sql"] }

I am using two Clojure provided functions to parse the XML document. "xml/parse[]" is provided in the clojure.xml package. "xml-seq[]" is provided in the clojure.core package. "xml/parse[]" uses a SAX parser to read the XML document. "xml-seq[]" stores the parsed data into a sequence of tree structures.

Here is an example of what "xml-seq[]" creates:

(:guid [\"http://test-site.com/jobs/555666/entry-level-developer-test4\"]
:link [\"http://test-site.com/jobs/555666/entry-level-developer-test4\"]
:category [\"css\"] :category [\"javascript\"] :category [\"html\"] :title [\"Four\"]
:description [\"Senior Programmer\"] :pubDate [\"Thu, 04 Dec 2013 14:44:52 Z\"]
:a10:updated [\"2013-12-05T12:44:52Z\"])

The challenge is to convert

category [\"css\"] :category [\"javascript\"] :category [\"html\"] 

to:

 :categories ["css" "javascript" "html"]

A Clojure map's key must be unique. Thus, the multiple symbols for :category is our challenge. We have to process those symbols (and corresponding values) as some type of list.

So as a long time programmer. Someone who has programmed in procedural languages like "C". Someone who has programmed in Object Oriented Languages like "java". My first thought was:

  • Create an iterator.
  • Iterate (loop) through the list.
  • Inspect the data segment in each iteration.
  • If the data segment meets our criteria, do something, like make a copy of the value.

To summarize, I am thinking of a "for" loop. Thus, my first implementation looked like this:

 (defn extract-categories [data-vec]
  (vec (let [rl (range 0 (count data-vec))
              f (for [n rl]
                  (cond (= (keyword (nth data-vec n)) :category)
                      (first (nth data-vec (+ n 1))))) ]
      (filter (fn [a] (not (nil? a))) f))))

It works. It loops through the data structure and identifies ":category". When ":category" is identified, it grabs the next data segment, because that's where the skill value is located. The "filter" at the end of the function, removes any nil values.

So the first version is pretty detailed. It works, but we can implement the function differently.

In Clojure, you don't need to explicitly write iterators and loops. If you want to do something with each data segment in a list, you can use "map". If you want to select a portion of the list based on some criteria, you can use "filter". In addition, you can use anonymous functions, if your "map" or "filter" contains a very specialized routine. In other words, use a anonymous function if the routine is not going to apply elsewhere in your code.

So the re-factored "extract-categories[]"method becomes:

 (defn extract-categories [data-vec]
  (let [ key-set (filter #(if (not (vector? %1) ) %1) data-vec)
         val-set (map first (filter vector? data-vec))
         maps (map #(assoc {} %1 %2) key-set val-set)
         categories (filter #(if (= (ffirst %1) :category) %1) maps)
         category-vals (vec (map :category categories ))]
    category-vals))

The new logic is simple:

  1. Filter the list for anything that is not a vector. That gives me all the keys.
  2. Filter the list for everything this is a vector. The first element in each vector is our values.
  3. Combine keys and values to form a pseudo map (key/value pairs).
  4. Filter the key/value pairs. Select pairs that have :category as a key.
  5. Remove the keys and store the result values in a Clojure vector.

The second implementation is much easier to read. Much easier to describe. When your function is easy to describe, you are more likely to document your source code.

The second implementation also established a pattern. I ended up re-factoring my overall XML document process to use the same pattern. I take the list of trees created by "xml-seq[]", extract the keys and values. I then combine the keys and values into pairs. I then extract relevant data by it's key (Clojure will process the key/value pairs as if they were a Clojure map).

You can peruse the source listing rss.clj for full XML document processing.

Reporting the Data
To create graphic reports like a "horizontal bar chart", I am using "incanter's" chart library.. Incanter charts uses the Java JFreeChart library.

The following code creates a simple bar chart. For a specific snapshot I want to see to 10 top skills. Top skills meaning, the skills that appeared the most. For each skill I calculate the percent of jobs the skill was present in. The percentage is displayed in a graph.

;; conn parameter is a database connection
;; snapshot-description parameter is our free-form text identifier for a snapshot
;; top parameter is the number of skills we want to report
;; total parameter is the total number of jobs found in the snapshot
(defn display-top-stats-chart [conn snapshot-description top total]
 (let [ title (format "%d Job Postings" total)
        rows (sort-by first (take top (reverse (sort-by last (get-job-skills-freq conn snapshot-description)))))
        skills (map first rows)
        percents (map #(float (* (/ (int (nth % 1)) total ) 100 ) ) rows)
       ]
    (view (bar-chart (vec skills) (vec percents) :title title :vertical false :x-label "Skill" :y-label "% of Job Postings" ) )
 ))

The second statement does the following:

  • Use the number of times a skill was found and sorts the list by that value.
  • Reverses the list so that the values are in descending order (highest values first).
  • Makes a copy of the "top" number of sorted values. E.G. top 10.
  • Resorts the results by skill name.

The last statement actually renders the graph. It's a Java Swing frame.

Using Clojure Test Fixtures

The last item I wanted to discuss was Clojure Test fixtures. A test fixture is a resource a set of tests need, in order to run. For example, in this project we need a RSS XML document to test our parsing routines. For the database functions, we needed a test database.

Setting up the Clojure Test fixture was not obvious to me. Thus, I though it was worth noting.

I created a separate disk/directory called "test/config". Under "test/config" I created a file for each test suite fixture. For example, the fixture setup for the rss tests is located at test/config/rss.clj. That listing contains one function. The function creates some singleton data structures. The method signature looks like this:

(defn load-parse-sample2 [ function-list ]

Note! The function parameter is used by Clojure Test. Clojure Test passed a list of our unit tests to load-parse-sample2[]. Thus in the body of our test setup function (load-parse-sample2[]), we must make a call to each function.

Calling the list of unit test functions is actually must simpler than it sounds. Here is the statement used at the end of of our setup function (load-parse-sample2[]).

(function-list)

We now need to add a statement to our unit test listing. For RSS XML processing our unit tests are located in "test/skill_data_mine/rss_test.clj".

We add the critical statement at the top of our listing:

(use-fixtures :once load-parse-sample2)

The result is, when we run our tests from the command line, Clojure Test first runs the setup function "load-parse-sample2[]. The setup function then runs each of the unit tests located in rss_test.clj.

Using Leinigen, we can run the RSS XML test suite with:

$> lein test skill-data-mine.rss-test
About the Author:
Lorin M Klugman - I'm an experienced developer. My main interest is in new technology. Please use our contact box here if you are interested in hiring me. Please no recruiters :)