Datetime:2016-08-23 02:14:10          Topic: Node.js  Cluster Analysis           Share

We started the EclairJS project to bridge the gap between Node.js applications and Apache Spark™.

Web application developers want to incorporate increasingly sophisticated analytics and more types of data. Node.js, one of the most popular frameworks frameworks for quickly developing front-end applications to the enterprise, scales extremely well, but comes with a surprising limitation: because it favors the creation of network connections over the completion of back-end processing tasks, it is not a good platform for large-scale data processing.

Apache Spark is, of course, an engine purpose built for large-scale data: it’s fast, because of it’s in-memory processing model, and it’s highly scalable, with scale being achieved by adding additional compute nodes to a Spark cluster.

EclairJS project bridges the gap between Node.js applications and Apache Spark by providing the Spark API in JavaScript: a language that is not otherwise supported in Spark.

To illustrate EclairJS’s ability to bridge the gap between Node.js and Spark, we’ll describe a Node.js application written in JavaScript that uses one of Spark’s machine learning algorithms called k-means ( ) which is used for clustering observations that have similar characteristics.


Imagine we are building a real-estate application and we want to segment (cluster) the properties in a regional housing market by price, square footage, number of bedrooms, etc so we can help sellers determine which segment they should sell into. Using EclairJS we can write some JavaScript that will first create a k-means model describing the various segments, and then predict for any new property which segment it should belong to:

 1 var spark = require("eclairjs")
 3 var sc = new spark.SparkContext("local[*]", "K Means Example");
 5 var rawTrainData = sc.textFile("trainData.txt");
 6 var trainData = (line) {
 7     var tokens = line.split(" ");
 8     var point = [];
 9     tokens.forEach(function (t) {
10         point.push(parseFloat(t));
11     });
12     return Vectors.dense(point);
13 });
15 var model = spark.mllib.clustering.KMeans.train(trainData, nClusters, nIterations);
17 model.clusterCenters().then(function(results) {
18     console.log('WSSE: ', results);
19 });
21 model.computeCost(points).then(function(results) {
22     console.log('Cost: ', results);
23 });

We start by loading the EclairJS module using Node.js’s function, and then create a SparkContext which is the context for all the Spark operations and variables (lines 1 & 3). In this example, we specify that the context should run on our local machine, but this is where we could specify a remote Spark cluster. After creating the context, we read and prepare the data which will be used to train our model and represent it as an array of Vectors (lines 5-13). One point to notice here is that the inline function definitions, i.e. the parameters of <.map> and <.forEach>, are written in JavaScript. This is significant because under the covers these functions are executed on the Spark cluster’s distributed worker nodes. Once the training data has been prepared, we use it to train the model (line 15).

Computing the model and applying it to make predictions from new data may take some time. However, in an interactive user application, we will often want to continue executing statements rather than stopping to wait for the results from such long-running processes, and in these cases we need a mechanism to handle the results when they are finally returned. You can see how EclairJS accomplishes this with the <.then> functions (lines 17 & 21) which take as arguments call-back functions to be executed when the model has been computed. So when our applications runs, it will only print the parameters of the model (line 18) and the results of applying the model (line 22) when those results are available, and until that time it may execute statements appearing later in the application, i.e. after line 23, in our example.

Visit EclairJS on GitHub or join the community: email us at to join the community Slack and group list.

About List