January 13th, 2016
Today we proudlyannounced that Arkena , one of Europe’s leading media services companies, is using Hortonworks Data Platform (HDP™) to provide its media customers with an advanced analytics platform to deliver content to OTT customers through its content delivery network (CDN). This is a guest post from Reda Benzair the Vice President of Technical Development at Arkena. You can also join Arkena and Hortonworks February 16 th for a live and on-demand webinar about their Advanced Analytics Platform clickhere.
Delivering an Advanced Analytics Platform for Media Management with Arkena
We are one of Europe’s leading media services companies with 20 years of experience in the media industry. We have a presence in eight European countries and the U.S and serve more than 1500 customers, such as broadcasters, telecom operators, VOD platforms, content owners and corporations in managing their linear and on-demand workflows. These workflows, relying on a CDN, generate high volumes of log data – up to 200 GB per day –with 60,000 events per second received from the CDN/OTT platform and which needs to be managed, processed and stored in real time for analytic purposes. Arkena provides an OTT solution with modular components that enables content owners, telecom operators and broadcasters to provide video content to viewers worldwide. The know-how, the experience and the expertise which enables their customers such as BeIN Sport (France), Canal+ (France) and more, to distribute their content everywhere, regardless of scale and complexity.
Our data challenge
In the media industry, quality of experience is just as important as quality of content. One in five viewers will abandon poor streaming video experiences immediately. Our customers wanted real-time control not only over the content but also over the data.
Our streaming media platform produces a tremendous amount of data and had pushed our existing systems over their limits. Adaptive bitrate streaming (ABS) is a new technique for streaming multimedia over computer networks. Historically, most video streaming approaches were founded on RTP or RTSP; but now most adaptive streaming technologies are built for transmission over HTTP over large, broad-distribution networks. HTTP-based adaptive bit rate technologies are significantly more operationally complex and generate more data per stream than traditional streaming technologies. For example, if a user streams just one 1 hour HD movie with 8 video tracks for different bitrates and 1 audio track over the MPEG-DASH protocol or HLS protocol, they will generate around 4200 log events on the edge. You can imagine the data volumes that are generated when tens of thousands of users connect to the video stream of a football (soccer) game.
We needed the ability to collect data from across our platforms in order to provide customers with insight into the performance of their content. For example we are creating between 20 Mbps and 40 Mbps of raw data logs resulting in 100 GB to 200 GB of log data per day and we can reach 120 Mbps at peak. Overall across our platforms we have to store and secure 3.5 terabytes of uncompressed data per day. Additionally, we also wanted to be able to cost effectively create an online archive of the data logs for 12 months for advanced long term analysis of content performance.
Additionally, we wanted our platform to provide very granular analysis by drilling down to fifteen different metrics such as volume, session duration and unique users. This could be split out over 15 different dimensions including country, city, user agent, browser and HTTP status code. Having implemented the Dynamic Adaptive Streaming over HTTP protocol, we were experiencing increased volume and speed of data, and as I mentioned our previous system could no longer support our needs.
An open source platform solution
To give our customers the real-time insight they needed our goal was to provide updated statistics every 15 minutes in an easy to access dashboard. Combine all of this together and that a Hadoop-based data platform was the right solution to meet this challenge. We turned to Hortonworks and Hortonworks Data Platform (HDP) with fully integrated Spark Streaming computing framework to enable us to quickly provide the big data architecture required to meet our customers’ evolving demands.
In this solution the data processing is accomplished with a combination of Apache Hive for scalable interactive processing and Apache Spark Streaming for the real-time processing. The strong data security features of HDP provided by Apache Knox and Apache Ranger allows us to keep data secure for each of our customers.
Data is all securely stored in HDFS because of the YARN based architecture of HDP we are able to flexibly run multiple processing and analytic workloads across that data. Single entry point from HDFS, can query and cross-correlate everything from the beginning of time.
We also implemented ElasticSearch for real-time search and analytics capabilities. ElasticSearch is well integrated with their Hadoop connector.
The advanced analytics platform architecture details
I’d like to share some of the details of our platform architecture with you. The production architecture of our analytics platform has four layers. The data is streamed from our edges into HDP using a combination of Rsyslog software with RELP protocol and Apache Flume. Data is all securely stored in the HDP cluster. The computations of an event are made in real time with Spark Streaming and more complex compute calculations are processed with the Hive/Tez frameworks. The results are ingested and indexed in an ElasticSearch cluster. On the front of cluster we’ve added an API layer to query the results.
Figure 1: Arkena Analytics Architecture
One of the most complicated challenges we had when designing the platform was the transport of the logs from the network edge to HDP with high reliability. To solve this challenge we combined the Rsyslog software with RELP protocol and Apache Flume. Figure 2 below is an overview showing the flow of the logs between the Edge Platforms and the main HDP cluster.
Figure 2 : Log Aggregator architecture
Logs are sent from the edge servers to Log Aggregators. There is one aggregator per PoP. In Figure 2 we can see different PoP’s used by Arkena to deliver their OTT/CDN Services. But, as the log aggregators are not specific to any PoP, we could reproduce this setup on any PoP, hereby designated as “PoPx” or “PoPy”, just by deploying generic log aggregators.
The log aggregator is responsible for reliably forwarding the logs to the Compute Cluster. If the compute cluster is unavailable, the logs are spooled on disk, and stay on the aggregators until the compute cluster comes back online.
The logs are ingested in HDFS once the local Rsyslog on each Hadoop node receives an event. Apache Flume is used to fetch the logs from Rsyslog and push them to HDFS. Figure 3 below describes which log transport software is involved in transferring the log to its neighbour at each step:
Figure 3: Transport logs
The following diagram shows exactly what happens on each of the Compute nodes and Storage nodes. The local Rsyslog forwards an event to the local Flume agent (TCP connection to
localhost ). The Flume agent then proceeds to send the logs to HDFS, while buffering them on disk for durability reasons. This happens in the following way:
- An SyslogSource, listening on a TCP socket, receives the incoming rsyslog event
- a “FileChannel” listens for incoming events on the rsyslog TCP source, and writes them locally to 10 different “datadirs” on 10 separate physical hard disk drives. Each datadir acts as a FIFO. Load is balanced evenly from the single Rsyslog TCP source to the 10 datadirs
- The FileChannel is plugged to 4 “HDFS Sinks”. When enough events have been buffered in the channel, those events are sent to the 4 HDFS sinks in an evenly balanced fashion. The HDFS sinks use the local HDFS driver, meaning the connection is local (but that the local data will be replicated to other HDFS nodes accross the network
- When the HDFS driver reports the event have been sync’ed to HDFS, the events are removed from the channel.
Figure 4 : Flume Filechannel
The number of HDFS Sinks and DataDir has been chosen because we empirically observed that was the optimal setup for our Hadoop Nodes.
We have chosen Spark Streaming for the advantages it gives to process in real-time all of the events that arrive from Flume, for the scalability and the fact that it supports the ingest of data from a wide range of sources including live streams from Apache Kafka, Apache Flume, Amazon Kinesis
Regarding the hardware of the HDP cluster, we made the choice on DELL R730 the configuration we have set with 16 Core, 128G RAM and 14 disk with 1To. We attempted to respect the rule of thumb for Hadoop of (1 Disk -> 8G RAM -> 1 physical core) in order to optimize the I/O performances with 10 file channel per machine and we kept 2 disk for the system.
For our ElasticSearch cluster, we choice 5 machines M610 in order to have an odd number for the redundancy and the failover.
After having experimented with the installation of the cluster we realized that the best way to provide the service to the developers was to fully automate the deployment. For this purpose we decided to use Jenkins since it is already widely used by our QA and development team as a continuous integration tool. In order to provide a seamless integration of the Ansible playbook with Jenkins we also decided to create an Ansible plugin for Jenkins ( https://wiki.jenkins-ci.org/display/JENKINS/Ansible+Plugin )
In QA the Jenkins server receives notifications from the git repository when a new version of a component of the cluster or a deployment playbook was released an automatic upgrade of the cluster was triggered in the QA environment. Afterwards the Jenkins server runs a batch of integration tests whose results are immediately published.
Through Jenkins the Ansible playbooks provide:
- Automated provisioning of VMs for test if needed.
- Configuration of Hadoop and ElasticSearch servers (operating system, users accounts…)
- Installation and upgrade of Apache Ambari cluster management application
- Configuration of Hadoop services
- Deployment and upgrade of an ElasticSearch cluster
- Deployment of monitoring probes, server and user interface
We monitor and manage our HDP cluster using Apache Ambari. We use grafana as a metrology interface used to plot the performance data polled by shinken probes and Thruk as a monitoring and alerting interface.
Open Source at Arkena
Arkena supports and likes multiple open source projects. Through my career at Arkena I have, managed to support and contribute to several open source projects .(ffmpeg, Varnish,…). We will continue to contribute some of our current efforts in the same way for example our Jenkins Ansible plugin the Ansible deployment scripts and more.
We continue to improve the clusters, adding new operational tools to make life easier for our operation team. Also we have decided to reconsider our Flume choices, we start looking at Kafka and Hortonworks Data Flow (Apache Nifi) ( https://nifi.apache.org/) .
Selecting Hortonworks as our partner
Starting in late 2014 when comparing Hadoop offerings, we looked for a fully open source option that wouldn’t lock the company in to one vendor or technology. We also wanted to be closely aligned with the open source development community of Hadoop. Finally, Arkena required a fully certified and supported software stack, and Hortonworks’ top-rated support capabilities were important. For these reasons, we selected Hortonworks and the Hortonworks Data Platform and were able to build and roll out the first version of our advanced analytics platform on top of HDP 2.3 in the fall of 2015.
We have a great partnership with Hortonworks and that has helped us be successful with out Hadoop implementation. We’ve been able to reduce our overall costs for storage and processing of data while meeting our goals of providing our users with advanced analytics and insight into their content.
Register for the webinar by clicking on “attend” below: