A URL Shortening Service

2018-February-10

I recently came across Spark, a micro framework for developing web applications for the JVM. I decided to try it out from Clojure by writing a URL shortening service. This post will walk you through the actual implementation of the service. In that proecess, we will create a Clojure wrapper for Spark's programming interface. We will also explore some ideas around configuring and scaling the service.

An Algorithm for Shortening URLs

The technique for shortening a long URL is quite simple ‐ convert the hash of the URL to a base-62 value. This is accomplished by the following code:


(def ^String base62-lookup "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ")

(defn decimal->b62
  [n]
  (loop [n n, rs []]
    (if (zero? n)
      (clojure.string/join (reverse rs))
      (let [[q r] [(quot n 62) (rem n 62)]]
        (recur q (conj rs (.charAt base62-lookup (Math/abs r))))))))

(defn encode
  [^String url]
  (decimal->b62 (.hashCode url)))

Calling encode on a URL will return the shortened version of it:


user> (encode "http://sparkjava.com/documentation")
;; "AuV1V"

Implementing the Web Service

The service exposes two HTTP API endpoints. One is for shortening a URL. This endpoint will internally invoke the encode function that was defined in the preceding section. The second endpoint will receive a value generated by encode and return a redirect to the original URL. Here is the specification for the API:


POST /
Request:          
  Content-Type: application/json
  Body: {"url": "a_url"}
Response on Success:
  Status: 200 OK
  Content-Type: application/json
  Body: {"hash": "hash_of_url"}

GET /:hash
Response on Success:
  Status: 302 Found
  Location: "original_url_from_which_hash_was_generated"
Response if :hash was not generated here:
  Status: 404 Not found

A Spark application uses a set of routes to declare its public interface. The HTTP API that we described above can be implemented with the post and get routes:


(defn -main
  []
  (Spark/post "/" (make-handler post-handler))
  (Spark/get "/:hash" (make-handler get-handler)))

A route has three components ‐ the HTTP verb (get, post, put etc), a path ("/", "/:hash") and a handler. The handler must be an implementation of a Java interface named Route. This interface has a single method called handle which is invoked to process a client request and generate a response. The convenience function make-handler is used to generate an implementation of Route. This takes the actual handler function as argument and arranges the Route's handle method to call it.


(defn make-handler
  [f]
  (reify Route
    (handle [_ ^Request request ^Response response]
       (f request response))))

The Request object contains information about the HTTP request, like its headers, content type, body etc. The Response object expose methods that the handler can call to generate a valid HTTP response.

Now we can proceed to implement the handler functions themselves. First, we will define the handler for the POST request. This handler will read the JSON encoded request body, parse it to extract the URL and generate the base62 hash of the URL. This hash is then send back in the response. The function will also map the hash to the original URL in an in-memory lookup table. The initial implementation of the POST handler is shown below:


(def db (atom {}))

(defn post-handler
  [request response]
  (let [r (cheshire.core/parse-string (.body request) true)
        url (:url r)
        short-url (encode url)]
    (swap! db assoc short-url url)
    (.status response 200)
    (.header response "Content-Type" "application/json")
    (cheshire.core/generate-string {:hash short-url})))

The GET handler receives a hash (or short-url) as input. This hash will be used as the key to query the lookup table. If a URL is found mapped to this key, a redirect is generated for this URL. If no mapping is found, an HTTP 404 (Not Found) response is returned.


(defn get-handler
  [request response]
  (if-let [url (get @db (.params request ":hash"))]
    (.redirect response url)
    (.status response 404)))

The first version of the URL shortening service is ready! You can download the complete project here. Execute lein run from the extracted project folder. The service should come up and start listening for incoming HTTP requests on port 4567. Here are a few curl sessions to test the service:


$ curl -v -X POST -d '{"url": "http://sparkjava.com/documentation#getting-started"}'\
  -H 'Content-Type: application/json' 'http://localhost:4567'

HTTP/1.1 200 OK          
{"shortUrl":"1BqMVO"}

$ curl -v 'http://localhost:4567/1BqMVO'

HTTP/1.1 302 Found
Location: http://sparkjava.com/documentation#getting-started

Adding Storage

One problem with the current implementation is that the hash->url mappings are stored in the memory of the service itself. If the JVM is shutdown, all data is lost and the users of the service will not be very happy :-). Moreover, it becomes impossible to scale the service by running multiple instances behind a load-balancer. So it is necessary to add a storage that can be shared by multiple instances of the service. This could be an RDBMS server like MySQL or a key-value store like Couchbase. I will use Couchbase for this example.

It is straightforward to make the service to talk to a data store like Couchbase, which just maps a string key to a string value. This model is similar to the one used by the current in-memory store. We can add a store abstraction to the service which internally uses the Couchbase client library for Clojure to talk to a Couchbase cluster:


(ns url-shortner.store
  (:require [couchbase-clj.client :as cb]))

(defn open-connection
  [props]
  (cb/create-client props))

(defn close-connection
  [conn]
  (when conn
    (cb/shutdown conn)))

(defn set-data
  [conn k v]
  (cb/set conn k v))

(defn get-data
  [conn k]
  (cb/get conn k))

The props argument passed to open-connection is a Clojure map that specifies the configuration (username, server urls etc) required to connect to the Couchbase cluster. This configuration may change from one deployment site to another. That means, we need our service to be able to dynamically load site-specific configuration information. An easy way to manage configuration is to encode it in EDN format. This will allow the application to reuse Clojure's built-in reader and parser to load and decode the configuration, as shown below:


(def config (read-string (slurp "./config.edn")))

(def db (store/open-connection (:store config)))

The contents of the configuration file is:


;; config.edn
          
{:web-server-port 8000
 :store
 {:username "Administrator"
  :bucket "default"
  :uris ["http://localhost:8091/pools"]}}

Note that we have made the port on which the service listens for incoming requests configurable as well.

Now we should update post-handler to store the mapping in the remote store:


(store/set-data db short-url url)

get-handler can lookup its response as:


(store/get-data db (.params request ":hash"))]

The -main function has to be updated to start the server on the configured port:


(Spark/port (:web-server-port config))

Simplifying the Programming Interface

Spark is not a framework designed with Clojure in mind. So it's a good idea to write a simpler and more idiomatic Clojure interface on top of Spark. This interface should expose the Request and Response objects as native Clojure data structures. The route specification should directly accept functions instead of implementations of the Route interface. The handlers should also be more functional in their behavior ‐ accept a request map and return a response map.

I wrote a Clojure wrapper for Spark that does all this. It provides enough abstractions for the web layer to be re-written in better Clojure style.

The code for the new web layer that makes use of this wrapper is reproduced below:


(ns url-shortner.core
  (:require [cheshire.core :as json]
            [url-shortner.encoder :as e]
            [url-shortner.spark :refer :all]
            [url-shortner.store :as store]))

(def config (read-string (slurp "./config.edn")))

(def db (store/open-connection (:store config)))

(defn post-handler
  [request]
  (let [r (json/parse-string (:body request) true)
        url (:url r)
        short-url (e/encode url)]
    (store/set-data db short-url url)
    {:status 200
     :headers {:Content-Type "application/json"}
     :body (json/generate-string {:hash short-url})}))

(defn get-handler
  [request]
  (if-let [url (store/get-data db (:hash (:params request)))]
    {:status 302
     :headers {:Location url}}
    {:status 404
     :body "Not Found"}))

(defn -main
  []
  (port! (:web-server-port config))

  (GET "/:hash" get-handler)
  (POST "/" post-handler))

The application code no longer has to deal directly with low-level Java abstractions. Instead all request handling is implemented using first-class Clojure data structures and functions.

The complete source code for the updated service can be downloaded here. Now you can start multiple instances of the service, by configuring a unique port number for each. Also make sure the :store configuration can connect the service to a running Couchbase cluster. You can distribute POST and GET requests across these instances and see them serving the requests from data in the shared data store.

Load-balancing and Scaling

Let us finish this post by automating the task of load-balancing between the several instances of the service. Nginx is a popular HTTP server and load balancer. To test load-balancing on my development box, I added the following to the local nginx configuration:


http {
    upstream localhost {
        server localhost:8000;
        server localhost:8002;
    }

    server {
        listen 8080;

        location / {
            proxy_pass http://localhost;
        }
    }
}

The above configuration basically means nginx will accept connections on port 8080 and forward those to one of the service instances running on port 8000 and 8002 of the same machine. The load-balancing method will be round-robin, which is the default.

Start two instances of the URL shortener, one on port 8000 and the other on 8002 and restart nginx. The calls for POST http://localhost:8080 and GET http://localhost:8080/:hash will be distributed between the two instances by nginx. To scale the service, start new instances and add them to the upstream configuration.

Conclusion

Spark provides a simple and clean interface for writing HTTP based services that can be easily integrated with any language running on the JVM. Adding a functional wrapper on top of its basic interface definitely makes it more appealing for server-side development in Clojure.