Data preprocessing and perceiving outliers using clustering algorithm on real – time dataset

Data Mining is an interesting, interpretable new knowledge about a large amount of data. Data mining can be considered as part of the knowledge discovery process. Data mining works on patterns which is used to find the patterns in the data set. Data mining is an interdisciplinary subsequence of computer science and statistics with overall extract information from the data set[12]. Data set is nothing but database, which is a collectable information that is organize so that information can be easily accessed, manage and updated. Data set may contain random errors called noisy data which is unnecessary for the data set. These unwanted data are called outliers. The outliers may occur due to incorrect entry, sampling error, mis- reporting, Exceptional but true value. The outlier in the actual world is dirty, inadequate, noisy and unpredictable data which gives no quality mining results. These data has to be clean using the method called preprocessing. Preprocessing involves data cleaning, data integration, data transformation, data reduction and data discretization. Among the preprocessing methods, data cleaning plays a vital role in removing outliers and resolve inconsistencies. These outliers can be removed using an open source software called WEKA (Waikato Environment for Knowledge Analysis) issued under the GNU General Public Licence [11]. The patterns of data mining can be described as rules for clustering, classification, summarization and association. Based on the trends and relationship between the data, clustering produces a group of modules. There are different kinds of clustering algorithm as kmeans, EM, Farthest first, Filtered cluster, hierarchical, density based algorithms. In this paper a real time data is preprocessed and detecting outliers using various clustering algorithm. The various clustering algorithms are applied on the data set using WEKA tool. The analyzing result of various clustering algorithm is used to find out which algorithm is more comfortable and time consuming for the user for performing clustering algorithm.

Author:

Dr. Rajeswari, J.

Download PDF:

8506.pdf

Journal Area:

None