Outlier Detection From Large Scale Categorical Breast Cancer Datasets Using Spark 2.0.0: Part I

Datetime:2016-08-23 04:15:44          Topic: Spark  Data Mining           Share

Outlier Detection: What It's All About and Why Is It Important?

Outlier detection is one  of the  most important processes of detecting instances  with  unusual behavior that occurs in a certain pattern in a given system. This, the discovery of non-trivial information in the dataset can be made by doing effective detection of outliers.

In the last decade, mining outliers have received significant attention and become an important research direction in the academia as well as industry. Because of its wide acceptance and applications in numerous domain such as detecting fraudulent usage of credit cards in the banking sector, unauthorized access in computer networks and biomedical  data analytics field [ 1 , 2 ].

There have been many efficient approaches to detect outliers in the numerical were proposed. However, for the categorical dataset, there are only a few limited approaches [ 3 , 4 ] have been published. Furthermore, the same task becomes more tedious while handling very large and complex datasets (i.e. dataset with multidimensional and unstructured contents). The reason is that any data point > 3*IQR (Interquartile range) is used to identify an outlier in a naive way. Moreover, there is no measurement with categorical data, as I understand. Therefore, to mining the outliers, an efficient, scalable and robust measurement is badly required.

Let's see a very simple example, suppose you have distributed 2000 Apples and Oranges (1000 each) to 1000 people. Now you ask them to choose either an Apple or an Orange. Finally, you found that 999 people have chosen Oranges and only one person went for an Apple. In this scenario, we can say that the person who did choose an Apple is an outlier. In this kinds of scenario, we use measurement as a way to detect anomalies . Now with the categorical data , we need to know why choosing an Apple is to be considered as an anomaly detection problem since that data point does not behave as the rest 99.9% of the total population .

The above example is too simple, however, what happens when someone wants to deal with the large-scale, complex and multidimensional dataset at petabyte or exabyte scale? More practically, many research areas have entered into the Big Data era since datasets are being generated in unprecedented ways in these domains. Biomedical data analytics is also no more exception but now certainly a Big Data area of concern that fulfilling the 5V Big Data criteria (i.e. Volume, Velocity, Variety, Veracity, and Value). As a result, finding the VALUE towards cancer diagnosis and prognosis out of such large-scale biomedical datasets is an emerging research requirement altogether.

State of the Art and Motivations          

Several initiatives have been taken for making the outlier detection scalable and faster [ 1 , 4 ]. Among them, the MR-AVF[1] algorithm was implemented using Hadoop-based MapReduce framework. However, this algorithm has several issues with I/O, algorithmic complexity, low-latency batch-processing jobs and fully disk based operation. In literature [ 4 ], the authors have proposed 1-parameter outlier detection methods namely ITB-SS and ITB-SP method, which is not scalable either. Among other considerable works includes [ 2 , 3 , 5 ] that are suitable outlier detection in distributed datasets with mixed-type attributes for in-memory processing only.

In contrary, Apache Spark’s in-memory cluster computing framework that allows user programs to load data into a clusters memory and query it repeatedly, making it well-suited to machine learning algorithms. Spark tries to cache the intermediate data into memory and provides the abstraction of Resilient Distributed Datasets (RDDs), which can be used to overcome these issues by making a difference achieving tremendous success in last few years for handling Big Data with Drug discovery, RDMA, Biological sequence alignment in distributed computing system, over statistical analysis for Network anomaly detection, Historical data, semantic analysis with an increasing demand to discover and explore data for real-time insights, the need to extend MapReduce became apparent and this led to the emergence of Spark. These facts and successes have motivated me to explore the other areas like Biomedical data analytics for applying Apache Spark based big data analytics.

Therefore, in this article, I will show how to calculate the outliers for large-scale categorical cancer dataset towards cancer diagnosis with Spark 2.0.0 using Java. For the technical implementation, the newly released Spark 2.0.0 which is smarter, faster, and lighter will be used.

Wisconsin Breast Cancer Dataset

In this section, I will describe the data collection procedure. A brief description of the dataset and some tips will also be discussed.

Dataset Collection

The Cancer Genome Atlas ( TCGA ), Catalogue of Somatic Mutations in Cancer ( COSMIC ), International Cancer Genome Consortium ( ICGC ) are the most widely used cancer and tumor-related dataset sources curated from MIT, Harvard, and some other institutes. However, these datasets are available as very unstructured; therefore, due to brevity, I could not use them directly to show how to apply large-scale machine learning technique on top of them. Rather, we will use simpler datasets that are structured and manually curated for the machine learning application development and of course many of them show good classification accuracy. For example, the Wisconsin Breast Cancer datasets from the UCI Machine Learning Repository available at http://archive.ics.uci.edu/ml. This data was donated by researchers of the University of Wisconsin and includes measurements from digitized images of a fine-needle aspirate of a breast mass. The values represent characteristics of the cell nuclei present in the digital images. 

Dataset Description and Exploration

The dataset was downloaded from UCI machine learning repositories [ 6 ]. According to the dataset description there, the dataset includes 569 examples of cancer biopsies, each with 32 features. One feature is an identification number, another is the cancer diagnosis, and 30 are numeric-valued laboratory measurements also called bi-assay. The diagnosis is coded as M to indicate malignant or B to indicate benign. The Class distribution is as follows: Benign:458 (65.5%) and Malignant: 241 (34.5%). Following this label and classification, we will prepare our training and test dataset accordingly. The 30 numeric measurements, on the other hand, comprise the mean, standard error, and worst (that is, largest) value for 10 different characteristics of the digitized cell nuclei. This include:

•    Radius

•    Texture

•    Perimeter

•    Area

•    Smoothness

•    Compactness

•    Concavity

•    Concave points

•    Symmetry

•    Fractal dimension

Based on their names, all of the features seem to relate to the shape and size of the cell nuclei. Unless you are an oncologist, you are unlikely to know how each relates to benign or malignant masses. These patterns will be revealed as we continue in the machine learning process. Here is a snapshot of the above dataset:


Interested readers should refer this article, to get more insights about the Wisconsin breast cancer data at publication "Nuclear feature extraction for breast tumor diagnosis. IS&T/SPIE 1993 International Symposium on Electronic Imaging: Science and Technology, volume 1905, pp 861-870 by W.N. Street, W.H. Wolberg, and O.L. Mangasarian, 1993".  

Be sure to check out Part II where we'll look at the actual steps you will need to perform!

About List