Clojure is a dialect of Lisp.

Data Mining with Clojure and Datomic

In this article we take a look at a few items:

  • Applying a "Functional Style" to our code. Making better use of Clojure's programming features.
  • Introduction to Datomic, a database written in Clojure.
  • Use of Clojure Test fixtures.

For our project, we are going to analyze job skills. We use Stack Overflow Careers 2.0 as our data source. For example, I query Stack Overflow Careers 2.0 for jobs within a 10 mile radius of my zip code. I search for job postings that contain the keyword "java" or "clojure". I capture the search results as an RSS XML document. I save the XML document to disk.

We use Datomic to persist our data. We also use Datomic to query our data. We then create reports from the queried data.

Datomic is not a relational database. However, it's pretty simple to define a Datomic schema which includes logical relationships.

For our project we first establish a "snapshot". A "snapshot" simply simply tells us when we collected our data. A "snapshot" includes a short description. Here is the "snapshot" schema definition supplied to Datomic:

{:db/id #db/id[:db.part/db]
  :db/ident :snapshot/time
  :db/valueType :db.type/instant
  :db/cardinality :db.cardinality/one
  :db/doc "time data was extracted. milliseconds"
  :db.install/_attribute :db.part/db}

  {:db/id #db/id[:db.part/db]
  :db/ident :snapshot/description
  :db/valueType :db.type/string
  :db/cardinality :db.cardinality/one
  :db/unique :db.unique/identity
  :db/doc "free form description of snapshot"
  :db.install/_attribute :db.part/db}

  {:db/id #db/id[:db.part/db]
  :db/ident :snapshot/job-set
  :db/valueType :db.type/ref
  :db/cardinality :db.cardinality/many
  :db/doc "List of Jobs obtained during snapshot"
  :db.install/_attribute :db.part/db}

Note the last attribute ":snapshot/job-set". That states a "snapshot" contains zero to many "jobs". The cardinality is "many", the datatype is "ref". We are storing references to "jobs". This definition is similar to a relational database "foreign key".

We use the same concept to relate a "job" to "skills". Each "job" has an attribute which stores zero to many "skill" references. A skill being something like "java", "sql", "python", etc.

The complete schema is located on github in schema.dtm.

The process for updating the database is simple. I first process the "skills". I query Datomic to determine whether a skill already exists in our database. If the skill does not exist in our database, we want Datomic to generate a unique identifier, AKA entity id. If the skill already exists, we use the existing entity id.

Here is the relevant code.

;; conn parameter is the database connection
;; queries the database for a particular skill (E.G. "programming")
;; if found, returns the database entity id
(defn get-skill-entity-id [conn skill]
 (let [results (q '[:find ?c :in $ ?t :where [?c :skill-set/skill ?t]]
              (db conn) skill) ]
         (first results)))

;; returns true if the skill can not be found
;; in the database (E.G. do we have "programming" stored in the database)
(defn skill-not-exists? [conn skill]
   (nil? (get-skill-entity-id conn skill)))

(defn process-skill [conn skill]
  (when (skill-not-exists? conn skill)
    (add-skill conn skill)))

Now, let's add the skill.

(defn add-skill [conn skill]
   (let [ temp_id (d/tempid :db.part/user) ]
      @(d/transact conn
      [[:db/add temp_id :skill-set/skill skill]])))

Now that the skills are processed, we can now process jobs. Remember, when we process a job we need to associate zero to many skills. So, we now have entity ids for all the skills. Thus, we can use the skill's entity ids as references for our jobs. Here is a portion of our code which stores a job.

(defmulti add-job (fn [one two three four five] (class five)))

;; job has skill list
(defmethod add-job clojure.lang.PersistentVector
  [conn entity-id title job-key skill-list]
  @(d/transact conn [{:db/id entity-id
                      :jobs/title title,
                      :jobs/job-key job-key,
                      :jobs/skill-set skill-list}]))

;; job does not have any skills 
(defmethod add-job nil
  [conn entity-id title job-key skill-list]
  @(d/transact conn [{:db/id entity-id
                      :jobs/title title,
                      :jobs/job-key job-key}]))

We define 2 method signatures, 1) a job with skills, 2) a job without any skills.

We use the same logic that we used for skills to establish an entity id. We look up the job, if the job is already present in the database we use the existing entity id. If the job does not exist, then let Datomic create a new entity id.

Finally we can add our snapshot to the database. The same logic applies. The snapshot contains zero to many job references. Now that we have an entity id for each job. We use the job entity ids are references in our snapshot.

Leveraging Clojure

Project Enhancements
This article is part of series where we build and enhance the "mobile site generation" project. Each installment in this articles series looks at different computer language. A quick recap of our project. The "mobile site generation" project, is a command line utility. The command line utility reads an RSS XML document and then generates a custom website. The custom website is viewable on mobile devices. You can read more details in previous articles. In this installment, we:

  • Add new navigation links.
    • The logo header now contains a link to our table of contents (index.html) page.
  • Add a meta-tag the html source code. The meta-tag is a list of keywords. The keywords are used by search engines like Bing and Google to categorize the website. We dynamically build keyword based on the RSS XML "category" tags.

In this article we implement and enhance our project application with Clojure.

Clojure is a dialect of Lisp . Like Scala, Clojure's default implementation runs on the Java Virtual Machine (JVM). Like Scala, there is also an implementation that runs on Microsoft's .NET CLR.

Concise Coding
Clojure fits the dictionary description of the word concise. Clojure expressions are very brief, yet very comprehensive.

To compensate for Clojure's brevity, I added a fair amount of comments to my source code. Writing Clojure code reminded me of writing "C" expressions. In "C" you can write a function that returns a pointer to a function which returns an array of pointers to structures. When you are writing such expressions, the ideas are fresh in your mind. But, if you haven't coded in "C" for a period of time, then you need an explicit explanation of those same, concise "C" expressions. Thus, I did the same for my Clojure code. I added explicit explanations to the Clojure source code. Along the same lines, I formatted the Clojure code in an outline mode. Probably not the typical format for a Clojure project. However, for folks following this series, I felt it would be easier to compare corresponding functionality (expressed in Java, Scala and Ruby).

Functional Programming
Both Scala and Clojure are referred to as "Functional" computer languages. "Functional" as opposed to "Object Oriented" (E.G. Java, Ruby), or "Procedural" (E.G. "C", Basic). In an Objected Oriented language like Java, you can compose an object tree, parent and child objects (E.G. inner classes). In a "Functional" language you express function trees (E.G. Higher level functions that contain either named or anonymous child-like functions).

For example. the following code listing includes 2 anonymous functions. Those functions are defined inside the definition of a larger function (not shown here, see function "main-process" ). The whole expression iterates first through a list of articles (as represented by "nodes"). Each article contains one to many categories. The expression then iteraties through the list of categories. The inner expression returns true if any of the categories match the criteria, "is equal to the word Polyglot". If the inner expression true, the article is appendend to the collection represented by the variable "poly-articles".

To summarize, the expression compiles a list of articles that have been a categorized as "Polyglot".

(let [ poly-articles
            ( fn[n]
              (some #(= "Polyglot" %) (:categories n ) )

Data Symbols
Note! In most computer languages,a symbol representing data is referred to as a "variable" (E.G. Java Integer myNum = 1;). I'll use the term "variable", loosely. I'll use "variable", just to make the code explanation a little more familar.

However, there is an important distinction. In Clojure, and in "Functional Programming" the data sent in to function (I.E. paramater) is not mutable. The data parameter does not change. The data parameter does not "vary". Thus the term "variable" doesn't quite fit.

Again, I'll use the term "variable" here, only, because most programmers understand "variable" means "data symbol".

Syndicate content