Developing Apache Spark Applications in .NET using Mobius

Datetime:2016-08-23 04:18:31          Topic: .Net  Spark           Share

To address the gap between Spark and .NET, Microsoft created Mobius, an open source project, with guidance from Databricks. By adding the C# language API to Spark, it extends and enables .NET framework developers to build Apache Spark Applications. This guest blog provides an overview of this #C API.

Apache Spark has transformed the big data processing and analytics space over the last few years. It provides high-level APIs in Scala, Java, Python and R and dramatically reduced the cost and complexity of building a wide variety of big data workloads. The results ofSpark Survey 2015 indicate that the ease of programming is one of the most important aspects of Spark. So it is apparent that having APIs in multiple languages appealed to various developer persona and contributed to the rapid adoption of Spark.

However, Spark had been out of reach for the .NET developer community. The results of Spark Survey 2015 also indicated that there was huge spike in the Spark usage in Windows, and there is a high likelihood that a good portion of the developers using Spark in Windows are .NET professionals. To address the gap between Spark and .NET, Microsoft created Mobius as an open source project with the goal of adding a C# language API to Spark enabling the usage of any .NET Framework language in building Apache Spark applications. With Mobius, organizations deeply invested in .NET can reuse their existing .NET libraries in their Spark applications.

Spark Applications in .NET

The C# language binding to Spark is similar to the Python and R bindings. In fact, Mobius follows the same design pattern and leverages the existing implementation of language binding components in Spark where applicable for consistency and reuse. The following picture shows the dependency between the .NET application and the C# API in Mobius, which internally depends on Spark’s public API in Scala and Java and extends PythonRDD from PySpark to implement CSharpRDD.

As shown above, the driver programs are written entirely in a .NET programming language like C# or F# using the C# API in Mobius. Mobius applications can be used with Spark deployed on premises or in the cloud. Mobius is supported on Windows and Linux. In Linux, Mobius uses Mono , an open source implementation of the .NET framework.

Developing & Submitting Mobius Applications

Mobius driver applications can be developed in an IDE (like Visual Studio) that supports .NET development. Mobius API and the worker implementation (used to execute user defined functionality in C# code in Spark worker nodes) are released to NuGet . Once these Mobius binaries and any other .NET library dependencies are added to the Mobius driver project in the IDE, the driver application code can be developed, debugged and tested like any other .NET program within the IDE.

Mobius driver applications in .NET are compiled into an executable (.exe file), which is copied along with its dependencies to the client machine from which Spark job needs to be submitted. A supported version of Mobius release is also needed on the client machine on which Mobius job submission script ( sparkclr-submit.cmd or ) is used to submit Mobius-based application to a Spark cluster. A Mobius job submission script accepts the same parameters as a spark-submit script , but it also needs an additional parameter for specifying the Mobius driver executable name and its path. As shown above, the driver programs are written entirely in a .NET programming language like C# or F# using the C# API in Mobius.

More information on running a Mobius application is available at on GitHub .

The Mobius API has the same method names and signatures with similar data types as the Scala API for Spark. As a result, the driver programs implemented using Mobius look similar to those implemented in Scala or Java. Here is a code example for implementing Spark’s “Word Count” example in C# using Mobius API.

var textFile = sparkContext.TextFile(@"hdfs://...");
var counts = textFile
             .FlatMap(x => x.Split(' '))
             .Map(word => new KeyValuePair<string, int>(word, 1))
             .ReduceByKey((x, y) => x + y)
             .Map(wordCount => $"{wordCount.Key},{wordCount.Value}");

The code snippet below is in F# and shows how to query the data in JSON format and use the DataFrame API to look for rows with State = ‘California’ and also register those rows as a temp table and use Spark SQL to query for all rows with name = ‘Bill’ .

let peopleDataFrame = sqlContext.Read().Json("hdfs://...")
let filteredDf = peopleDataFrame.Select("name", "address.state")
                 .Where("state = 'California'")
filteredDf.RegisterTempTable "filteredDfAsTempTable"
let countAsDf = sqlContext.Sql "SELECT * FROM filteredDfAsTempTable where name='Bill'"
let countOfRows = countAsDf.Count()
printf "Count of rows with name='Bill' and State='California' = %d" countOfRows

More examples for RDD, DataFrame and DStream API are available here . These examples also cover HDFS, Cassandra, Event Hubs, Kafka, Hive and JDBC sources in Mobius applications.

More Resources

You can peruse our GitHub repository , and we welcome your contributions. Additional information on Mobius is available in the slides and video from the talk on Mobius presented at  Spark Summit 2016. Finally, Mobius powers several .NET-based Spark workloads in Microsoft. For example, the Spark Summit 2016 talk ( slides , video ) covers the lessons learned using Spark in a Bing-scale workload.

About the Author

Kaarthik Sivashanmugam is a Principal Software Engineer in the Shared Data platform team at Microsoft. Kaarthik is the tech lead for the Mobius project . Prior to joining the Shared Data team, he was in the Bing Ads team, where he built a near real-time analytics platform using Kafka, Storm and Elasticsearch, and used it to implement data processing pipelines. Previously, at Microsoft, he was involved in the development of Data Quality Service in Azure and also contributed to multiple releases of SQL Server Integration Services.

Twitter: @kaarthikss

About List