Great minds think alike
(How can great minds like Shakespeare and Douglas Adams help us improve YARN)
Let me begin with a great quote from Douglas Adams – “ A common mistake people make when trying to design something completely foolproof is to underestimate the ingenuity of complete fools .”
In 2005, Doug Cutting, a Yahoo employee with a liking for cute baby elephants created Hadoop. Hadoop was created as a framework for processing massive amounts of data (big data, if you would like) on distributed storage and distributed computing platform. Hadoop was itself influenced by the legendary paper on MapReduce which is a programming model underpinning the Hadoop platform.
YARN (Yet another resource negotiator) came about a few years later as a natural progression to original Hadoop framework with loads of improvements and architectural changes. Unfortunately, along came people like me, whom Douglas Adams so aptly warned about!
My first attempts at understanding YARN were not encouraging. Anyhow, most of us were so busy assuaging our wounds post Great Recession era (2008-09) that we missed the intricacies of this great framework. My (re)attempt at taming the baby elephants came about a few years later. Yes, elephants need to be tamed too. I don’t think any of us are zoologists to disagree anyway. I needed to use YARN as the cluster for hosting a lot of our applications. This seemed natural because by now YARN was maturing fast and had become a standard framework for distributed applications.
Users submit their applications to YARN which launches them on nodes that it manages. Seems simple enough? Not really. Shakespeare was right when he said that there can be many a slip between the cup and the lip. It is possible that the application is kept waiting by YARN before it can be assigned any resources in the cluster. In an oversubscribed cluster, this becomes very prominent. While, this is understandable, YARN does not intimate the user about what may be the reason for application to have been accepted but not launched. This makes it difficult to assess the real problem. This is different than Mesos, a global resource manager for entire data center. Mesos offers guarantees back to the application. The application is free to reject those offers and not be at the mercy of framework. This is a better option. Application can also receive multiple offers, and make a delayed decision. Of course, I hasten to add that I am not a Mesos committer or a contributor.
Memory leak = revenue leak
My next yearning begins with a question. What happens when an application leaks memory? Well, what will happen is that the application will be terminated by YARN, meaning this is a stupid question. Ok, so here is what I mean. YARN allows controlling both physical and virtual memory usage of applications by enabling checks ( yarn.nodemanager.pmem-check-enabled and yarn.nodemanager.vmem-check-enabled ). Once an application attempts to go beyond its share of allocated memory it will be terminated. The problem here is that this setting applies at global level. Why would anyone not want to terminate a process which leaks memory? Good question. But it is not always a case of memory leak. Sometime, a process may have exhausted more than its share of memory because the user underestimated its requirement. Spark is a great example of this and anyone who has run spark jobs on YARN would have faced this problem –
“ Container killed by YARN for exceeding memory limits. 10.0 GB of 10 GB physical memory used. ”
The penalty “kick”
The penalty for very long running applications can be disastrous because application needs to be re-run. YARN can allow administrators to selectively relax memory requirement check for applications they just do not want to be terminated. I would also agree that proper sizing is a must and can prevent this issue too.
There is no easy way to know why an application may have been terminated. Differences in implementation by leading vendors such as Cloudera and Hortonworks also make it nontrivial. The list of reasons as to why an application can be terminated can be huge and YARN must absolutely provide some API (REST) through which user can know of the exact failure reason. The advantage of such an API is that the problem can be programmatically ‘examined’, ‘corrected’ and ‘resubmitted’ leading to a happier world. YARN maintains certain fields ( diagnostics & finalStatus ) which can be enhanced to provide such details.
Health is wealth
A similar point applies when YARN marks a cluster node as unhealthy and terminates all applications on it. One such reason could be a particular node having exhausted most of its disk capacity.
“ Node irl66dsg04:8041 reported UNHEALTHY with details: 1/1 log-dirs are bad: /var/log/hadoop-yarn/container ”
YARN uses the properties yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage and yarn.nodemanager.disk-health-checker.min-healthy-disks to determine when it considers a node as unhealthy. Once a node is marked as unhealthy no applications can be scheduled on it unless the problem is addressed. Applications on this node get terminated but no actual reason is given back to the end user in their application container logs –
“ Container Completion for containerID=container_1446277450132_0002_01_000002, state=COMPLETE, exitStatus=-105, diagnostics= Container killed by the ApplicationMaster. Container killed on request. ”
The reason for application container termination is not helpful enough. The exact reason (in this case disk space issue) must be made more explicit. In addition, sifting through multiple logs to determine the exact reason (as given above) is indeed difficult and YARN must provide a means to bubble up these errors to user directly.
Aggregate your data
Last but not the least, YARN allows for aggregating logs to its distributed file system HDFS. This is a very important feature and deserves a special mention. YARN also allows periodically uploading files to HDFS for very long running applications. The problem here is that like many other properties, this property also applies to all applications. This means if you have thousands of applications in your cluster the moment you turn on log aggregation all applications log files will start contributing to HDFS footprint. Keep in mind the replication factor (property dfs.replication ) of 3 which applies to HDFS by default and will soon be required to give some explanation to our enraged storage administrators.
To close this post I’d say my (re)attempts have been successful but then the more we learn, the less we know and this blog is a summary of what needs to be done more. Those are good problems to solve and for sure they are not trivial because as that saying goes – a problem worth attacking proves its worth by attacking back .