Revised Apache Spark MOOC

Datetime:2016-08-23 04:15:55          Topic: Spark           Share
Revised Apache Spark MOOC
Written by Alex Denham
Wednesday, 17 August 2016

The online course on Big Data Analysis from BerkeleyX on the edX platform started a re-run this week with a new focus. It now teaches students to program using Spark's Machine Learning pipelines and DataFrames.

CS110x: Big Data Analysis with Apache Spark is a four week course at intermediate level that opened on August 15 2016 and runs until September 12 is the successor to CS100.1x: Introduction to Big Data with Apache Spark and has the same overall goal of enabling students to learn how to use Apache Spark to perform data analysis. However, whereas the previous incarnation focused only on Spark programming using  lower-level Spark abstraction and programming paradigm of Resilient Distributed Datasets the new version shows how to use Apache Spark Machine Learning libraries to analyze Big Data using DataFrames, Spark SQL, and Resilient Distributed Datasets. This will make of interest to students who have taken CS100.1x but are unfamiliar with Spark Machine Learning pipelines as well as the new cohort of students coming to the course for the first time.

The course is taught by Anthony D Joseph who is both Professor in Electrical Engineering and Computer Science and Technical Adviser at Databricks. The previous version of the course received positive ratings (average 4.2 out of 5 stars) and the consensus was that the weekly labs were the core of the course. The course assignments for this version include Prediction using Machine Learning algorithms, Collaborative Filtering, and Textual Entity Recognition exercises that teach students how to manipulate datasets using parallel processing with PySpark, Spark SQL, and Spark Machine Learning Pipelines. The lab exercises account for 84% of the grade, the other 16% coming from multiple choice quizzes and all assignments are due by September 12th, 2016.

The syllabus of the course is as follows:

Week 1: Big Data and Data Science

  • Introduction to Big Data and Data Science  - examples of how data science can leverage big data, and learn about the risks of performing data science without statistics
  • Performing Data Science and Preparing Data  - explore data science definitions and topics, and the process of acquiring and preparing data, understand the statistics of Exploratory Data Analysis
  • Machine Learning  - learn about Spark's machine learning libraries, ML and mllib 
  • Lab 1: Power Plant Machine Learning Pipeline  data exploration and visualization, learn about Spark's Machine Learning Pipeline, and apply and evaluate several Machine Learning algorithms to answer a business question

Week 2: Performing Data Science 

  • Data Science Roles
  • Data Quality  
  • Data Cleaning  
  • Statistical Inference  - learn about estimation, bias, variability, data distributions and the Central Limit Theorem 
  • Lab 2: Collaborative Filtering on a Movie Dataset 

Week 3:Apache Spark's Resilient Distributed Datasets 

  • Spark Low-Level Primitives  - learn about Spark's Resilient Distributed Datasets, transformations, and actions, and Spark's shared variables 
  • File Performance  - understand the considerations for the performance of file read and write actions 
  • Lab 3: Text Analysis and Entity Resolution - perform text analysis and entity resolution on Google and Amazon product listings using Spark

Week 4:Statistics 

  • Statistics  - learn about relations, associations, trends, patterns, correlation, and regression

Although CS110x can be taken on its own, it is the second part of the three course X series. The introductory 2-week course, CS is currently underway but there is still time to join in this presentation with the advantage of becoming familiar with the PySpark environment and covering the basics.

The discussion forum for these classes is on Piazza and it seems a friendly and supportive environment with plenty of positivity about the course so far - especailly the labs.

More Information

CS110x: Big Data Analysis with Apache Spark

Introduction to Apache Spark

Related Articles

MOOC On Apache Spark

What is a Data Scientist and How Do I Become One?

Data Science Curriculum on edX

Coursera Data Science Specialization

Microsoft Launches Professional Degree Program With Data Science Pilot

Coursera Offers MOOC-Based Master's in Data Science

To be informed about new articles on I Programmer, sign up for our  weekly newsletter, subscribe to the RSS feedand follow us on  Twitter, FacebookGoogle+ or  Linkedin .

Bing Developer Assistant Adds C++ Support

19/07/2016

There's a major update to Bing Developer Assistant that means it now offers support for C++.

+ Full Story

Why Is C Top Language In IEEE Ranking?

27/07/2016

IEEE Spectrum has produced its interactive rankings of programming languages for 2016. This year C comes top of the overall list - which you may find somewhat surprising.

+ Full Story

More News
blog comments powered by Disqus
Last Updated ( Wednesday, 17 August 2016 )
 

Follow @Iprogrammerinfo

RSS feed of news items only

Copyright © 2016 i-programmer.info. All Rights Reserved.

Joomla! is Free Software released under the GNU/GPL License.





About List