What is K-means ?
K-Means clustering is an unsupervised learning algorithm. There is no labeled data for this clustering, unlike in supervised learning. K-Means performs the division of objects into clusters that share similarities and are dissimilar to the objects belonging to another cluster.
The term ‘K’ is a number. You need to tell the system how many clusters you need to create. For example, K = 2 refers to two clusters. There is a way of finding out what is the best or optimum value of K for a given data.
For a better understanding of k-means, let's take an example from cricket. Imagine you received data on a lot of cricket players from all over the world, which gives information on the runs scored by the player and the wickets taken by them in the last ten matches. Based on this information, we need to group the data into two clusters, namely batsman and bowlers.
Applications of K-Means Clustering
K-Means clustering is used in a variety of examples or business cases in real life, like:
- Academic performance
- Diagnostic systems
- Search engines
- Wireless sensor networks
- Customer segmentation
- Identifying crime localities
- Academic Performance :- Based on the scores, students are categorized into grades like A, B, or C.
- Diagnostic systems :- The medical profession uses k-means in creating smarter medical decision support systems, especially in the treatment of liver ailments.
- Search engines :- Clustering forms a backbone of search engines. When a search is performed, the search results need to be grouped, and the search engines very often use clustering to do this.
- Wireless sensor networks :- The clustering algorithm plays the role of finding the cluster heads, which collect all the data in its respective cluster.
- Identifying crime localities
with data related to crimes available in specific localities in a city, the category of crime, the area of the crime, and the association between the two can give quality insight into crime-prone areas within a city or a locality. here is an interesting paper based on crime data from delhi firs.
- customer segmentation
clustering helps marketers improve their customer base, work on target areas, and segment customers based on purchase history, interests, or activity monitoring. here is a white paper on how telecom providers can cluster pre-paid customers to identify patterns in terms of money spent in recharging, sending sms, and browsing the internet. the classification would help the company target specific clusters of customers for specific campaigns
K-Means use case in Security domain — Identifying crime localities
Criminal activities are a major cause for concern for law enforcement officials. Existing strategies to control crime are usually reactive, responding to the crime scene after the crimes have occurred. However, with the advent of technology and data analytics, it is now possible to recognize patterns in criminal activities using historical data and help law enforcement officers do a better job in crime prevention and control.
Steps of crime pattern analysis
Step 1 :-
Determine the geospatial plot of crimes in the city:
The first step is the collection of crime information in a given city. This is usually available from multiple places such as law enforcement reports, victimization statistical surveys, collation of newspaper articles etc. This data can be plotted on a geographical map such as the one shown above.
Step 2 :-
Use clustering techniques to identify patterns:
Clustering is a method to depict the dataset in the form of subsets called clusters so that the observations in the same cluster make some sense. It is a method of unsupervised learning and is used for statistical data analysis.
The use of K-means data mining approach helps us identify patterns since it is very difficult for humans to process large amounts of data, especially if there are missing information to detect patterns.
Clusters are useful in identifying a crime spree committed by a single or the same group of suspects. These clusters are then presented to the detectives who drill down using their domain expertise to solve the cases.
K-means clustering is one of the methods of cluster analysis. In the K-means algorithm, each point is assigned to the cluster whose centroid is the closest. K-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. It can be applied to relatively large sets of data.
Use the following steps for cluster analysis:
- Sorting of the records - the first sorting will be done on the most important characteristics based on the detective’s experience.
- Data mining is then used to detect more complex patterns as in real life there are many attributes associated with the crime and we often have partial information available.
- Identification of significant attributes for clustering.
- Placing different weights on different attributes dynamically based on the crime types being clustered.
- Cluster the dataset for crime patterns and present the results to the detective or the domain expert along with the statistics of the important attributes.
- The detective looks at the clusters and gives recommendations.
- Unsolved crimes are clustered based on significant attributes and the result is given to detective for inspection.
Step 3 :-
Analysing patterns and drawing conclusions :-
This involves the analysis of each cluster formed. The computer is unable to understand what is unique about each cluster. This is where human expertise comes into play. For example, all the crimes committed in red may have been committed using a similar gun or that all the crimes shown in blue may be due to theft of jewellery where people were walking on the road and the assailants were traveling on a motor bike etc. This helps to find crime patterns and trend correlations. Once a specific pattern is detected, the law enforcement officers can deploy additional and suitable resources for detection and suppression of criminal activities.
Advantages of clustering for crime pattern analysis
- This approach helps us to analyse the historical crime rates and enhance the crime resolution rate of the present.
- Take actions to prevent future incidents by using preventive mechanisms based on observed patterns.
- Reduce the training time of the officers that are assigned to a new location and have no prior knowledge of site-specific crimes.
- Increase operational efficiency by optimally redeploying limited resources to the right places at the right times.
Limitations of crime pattern detection
- Crime pattern analysis can only help the detectives and not replace them. It is up to the human experts to interpret what the clusters are telling us.
- Data mining is sensitive to the quality of input data and that can be inaccurate sometimes. Missing information can also cause errors.
Mapping data mining attributes is a difficult task and hence it requires a skilled data miner and a crime data analyst with good domain knowledge.
Example using K-means on Sigma Magic software
Here is a K-means analysis of the Crimes in India. This example uses randomly generated data to illustrate the concepts and there is no correlation with real data. The data included the places, the number of murders, theft, cybercrimes, and the percentage of the population living in the urban area. The number of clusters K is 4 and it took 3 iterations to obtain the pattern.