K-Means Clustering

Sgrgyanchandani
4 min readJul 15, 2021

--

What is Clustering?

Let’s kick things off with a simple example. A bank wants to give credit card offers to its customers. Currently, they look at the details of each customer and based on this information, decide which offer should be given to which customer.

Now, the bank can potentially have millions of customers. Does it make sense to look at the details of each customer separately and then make a decision? Certainly not! It is a manual process and will take a huge amount of time.

So what can the bank do? One option is to segment its customers into different groups. For instance, the bank can group the customers based on their income:

Can you see where I’m going with this? The bank can now make three different strategies or offers, one for each group. Here, instead of creating different strategies for individual customers, they only have to make 3 strategies. This will reduce the effort as well as the time.

The groups I have shown above are known as clusters and the process of creating these groups is known as clustering. Formally, we can say that:

Clustering is the process of dividing the entire data into groups (also known as clusters) based on the patterns in the data.

What is meant by the K-means algorithm?

K-Means clustering is an unsupervised learning algorithm. There is no labeled data for this clustering, unlike in supervised learning. K-Means performs the division of objects into clusters that share similarities and are dissimilar to the objects belonging to another cluster.

Use Case: Identify Outlier Access

Clustering and K-Means can be used for traditional role mining — to clean up access by providing additional visibility to access that is being used. The average user has more than 100 entitlements and that can be very difficult to manage manually. Through the use of the Clustering and K-Means machine learning model, we can detect access outliers by analyzing what’s going on with dynamic peer groups of users.

Let’s look at an example.

On a lovely Saturday afternoon, the company access data shows an employee from IT working on your production finance system. This is seemingly an outlier activity for an IT employee, as it’s not typical for someone in this role to be accessing a production finance system, much less on a Saturday afternoon. So, is this risky activity? As well, at the exact same time and on the same day, you have a business analyst accessing and working on that same production finance application.

If we examine these two access activities individually, we might perceive a problem. Yet, if we combine these two access data points dynamically, the situation may appear to be less risky.

Now, let’s add an additional person from the Finance organization, a financial analyst, and they are also accessing the same production finance application and on the same Saturday. We have three instances of three different people, from different work groups, all accessing the production finance system at the same time and on the same day. So, what’s going on?

What’s most likely taking place in this scenario is these employees are working together to perform a system upgrade or are resolving a production issue occurring in the financial system. From a real-world viewpoint, where we can examine traditional static data attributes such as job title or department number, these three employees would not be considered a relevant peer group. From a behavioral analytics standpoint, these three employees do comprise a dynamically generated peer group, as there is system data logging their actions of accessing the same production finance system at the same time.

Dynamic peer groups are clusters of users that are created as Gurucul Risk Analytics ingests log data, in near real time, all internal to the machine learning algorithms. Dynamic peer groups are fairly transient, yet they can be retained for future reference.

What are the Benefits of Clustering and K-Means?

The benefits of the Clustering and K-Means machine learning model in Risk Analytics are numerous. Key features are the ability to flag and remediate or revoke questionable access. We know that most user identities are over provisioned, and if those identities are compromised, tremendous damage can occur.

Clustering and K-Means helps to refine, resolve and reduce false positives. By employing Clustering and K-Means machine learning, plus applying dynamic peer grouping technology, Risk Analytics can reduce false positives 10x compared to the use of static groups from directories like Active Directory.

Thank you for reading,

I hope you liked it…

If you want to connect with me, here’s the LinkedIn URL-

https://www.linkedin.com/in/mr-sagar-gyanchandani/

--

--

No responses yet