Saturday, July 15, 2023

Day 3 of #100daysofnetworks

Welcome to day 3 of #100days of networks. 

If you would like to learn more about networks and network analysis, please buy a copy of my book!

Today, we are going to talk about CENTRALITIES. Network Centralities are a useful tool to quickly identify interesting nodes (people, things, etc) from any network. Once you have built a graph, you should use centralities to get a lay of the land, to "learn the main characters", so to say.

In today's exercise, we will use the Les Miserables graph from NetworkX, to keep things simple.

You can use my Github code to follow along.

Here is a bit about centralities:

  • Degree Centrality: Importance based on the number of degrees (edges)
  • Betweenness Centrality: Importance based on whether a node sits between other nodes; Information flows through them. Can also be gatekeepers. They have power.
  • Closeness Centrality: Importance based on a nodes closeness to other nodes. Has to do with number of steps away.
  • PageRank: Importance based on number of inbound and outbound edges. Inbound is more important than outbound.
  • There are many, many, others.

In today's exercise, I will show how to use the above, as well as another algorithm called HITS, which is used to identify hubs (many outbound edges) and authorities (many inbound edges).

These are what I consider "starting centralities", in that I always use them, for any graph, to get a lay of the land, so to speak. However, use will depend on the scale of the network. If you are below million scale, then all of these should work, but betweenness and closeness will gradually slow down to the point of being impractical. If you are above million scale, PageRank will be your go-to algorithm. PageRank was created by the founders of Google and it scales well. Betweenness and Closeness Centrality do not scale well, but they are very useful on smaller networks, or on subsets of networks.

Network Spot Check

Before doing anything with any network, it is useful to do a few spot checks. For instance, it is useful to know the size of a network before choosing algorithms for working with the network. However, I have used this particular network several times and know that it is small and that none of today's algorithms will have any issues with a network of this size, so let's keep going. I'll show you how to get useful network metrics for Network EDA (exploratory data analysis) on another day. 

For today, as this is a small network, let's just visualize it.

This will do nicely. Click on the image to look closer. What insights can we gather simply by LOOKING at the visualization?

  • Valjean is clearly an important character. He sits in a very central position in the network, and there are complex relationships that exist around him, shown by the network structures (clusters) nearby. To get from one side of the network to the other side, paths go through Valjean.
  • Gavroche is another important character. He is well-connected to very dense parts of the network, but more connected than others.
  • Myriel is another important character, as indicated by the different node color.
  • There are several clusters of densely connected nodes. These clusters form interesting communities and should be explored both as a single entity, but also at the single node level. How do individuals in these communities behave? How do the communities as a whole behave compared to the rest of the network? What sets them apart?
  • There are many nodes with a single edge. These are support characters. Napoleon is one. Napoleon is an interesting character in history, but is only linked to a single character in the book.
  • There are no isolates in this network. Isolates are nodes without any edges. They will show in a network visualization as just a dot, orbiting on the outside of a visualization. Network visualization software typically keeps them away from the connected parts of a network.

Nice, so we can see a few things:

  • There are important nodes
  • There are communities (community detection will be useful; we will cover it later)
  • There is only one giant cluster in this network. Often, in real-world networks, there will be several complex and large ecosystems that exist. Because there is only one big structure in this network, our work will be easier.

But today's exercise is about centralities, and this was just the spot check. Let's keep moving. This spot check looks good. We can see the network, and this network is small enough that we can use our own eyes for insights. But let's use algorithms to make our work easier.

I'm going to keep my descriptions as simple as possible. If you are interested in the math or research behind these algorithms, please check the links below, and read the research papers mentioned on NetworkX.

Degree Centrality

Here is more information on Degree Centrality.

Degree Centrality has to do with the number of degrees (edges) that nodes have. For instance, if most nodes in a network have 1-3 edges (lines), and a few have (20-30), then those few nodes are probably very important. Because for some reason, they sure have a lot of connections. 


Here are the top ten characters by Degree Centrality. 

We were able to use our eyes to see that Valjean and Gavroche were two of the most important characters, based on their position in the network, and this shows the same. It also tells us other important characters that we should look into.

Degree Centrality does not care about the direction of edges (lines) from nodes to other nodes. It just has to do with the total count. It is direction agnostic.

Degree Centrality should work well on graphs of any size. I have used them at million scale. It should work at billion scale.

Betweenness Centrality

Here is more information on betweenness centrality.

Betweenness Centrality is actually my favorite centrality, because it has to do with information flow. Let's pretend there are three people (A, B, C).

A - B - C

In order for Person A to share his idea with Person C, he needs to go through Person B.

In the same way, if Person C wants to share her idea with Person A, she will need to go through Person B.

Person B is very important in this situation, as all information must go through that person.

Betweenness Centrality has to do with paths that flow through nodes. For instance, For Person A to talk to Person C, the path goes A -> B -> C. For Person C to talk to person A, the path goes C -> B -> A. In a network with any complexity, there are many paths, and these are used in calculating Betweenness Centrality scores.


These are the top ten characters by Betweenness Centrality. Look at this visualization, and then compare it to degree centrality. Notice how different Valjean's betweenness centrality is from everyone else's. Also notice that Gavroche is not in the second position for Betweenness Centrality. Myriel holds that position. Find Myriel in the full network visualization and try to see why. 

Betweenness Centrality slows down as networks grow in size, as there are more nodes, more edges, and thus, more paths. It is a useful algorithm at thousand scale, but once at million scale, you might want to use PageRank as your primary algorithm for determining node importance. At thousand scale, I like to use Betweenness Centrality AND PageRank as my primary algorithms for importance. I use PageRank at every scale.

Closeness Centrality

Here is more information on closeness centrality.

Closeness Centrality is another algorithm that uses paths in its calculations, like Betweenness Centrality. That means that it suffers from the same problem. It is useful on small networks, at thousand-sale, but it is impractical for million or billion scale.

However, this is a very useful measure. It indicates nodes' closeness to other nodes. Put another way, is a node in the city, or out in the woods? A persons opportunities are partially dependent on environment.

Personally, I always include this measure in my analysis, but it is more for added context, and less useful to me. Other measures give me more valuable insights (such as Betweenness Centrality). But I always include this, for context.


In terms of closeness, Valjean and Marius are in the first two positions, and Valjean stands out compared to the others. The others are pretty similar.

PageRank

Here is more information on PageRank.

PageRank is extremely useful. It was created by the founders of Google and part of how Google search worked. PageRank has to do with inbound and outbound links, so directionality is implied, but it also works well with undirected networks. Our Les Miserables network is undirected. 


PageRank clearly identified Valjean as the most important character, with Myriel in distant second place, and Gavroche in third.

PageRank is a great algorithm for importance. In my experience, it is always useful. Get used to including it in your analysis.

PageRank was created for the internet, which is a BILLION SCALE network. PageRank works well at any scale.

HITS

Here is more information on the HITS algorithm. 

HITS is a very cool algorithm that identifies hubs and authorities. Hubs are nodes that have many outbound links, and authorities are nodes that have many inbound links.

For instance, if a website is linked to by ten thousand websites, then that website is possibly an authority on whatever content they publish. Or, on social media, if an account is retweeted by a million other accounts that account is possibly an authority on whatever they talk about. However, that is in an ideal world. We live in a world of artificial amplification, where bots and blogs provide artificial amplification, but that is for another day.

A hub, on the other hand, has many OUTBOUND links. For instance, there are some cool websites that link to the weirdest news on the internet. One website might link to one thousand websites. One website is sending internet traffic to one thousand parts of the internet. That one website is a HUB.

Authorities: many inbound edges (lines, links).

Hubs: many outbound edges.

Today's network is undirected, and this algorithm needs a directed network to be most useful. However, it will still work with an undirected network, just the two result sets will be identical. 

Here are the two visualizations:



Notice that the two visualizations are identical. If we did this with a directed graph as input, the two visualizations would be distinct and more useful.

This algorithm isn't particularly useful today, but it will be useful with any directed network. And even though it was unable to discern hubs vs authorities, it still identified Gavroche and Valjean as the two most important characters, based on network position.

So, which is better?

Ok, cool. So, we looked at a few choice measures for centrality, but which one is the best? NONE OF THEM ARE THE BEST. They have different uses. Personally, if I have to pick only one, I'll use PageRank, but PageRank doesn't say much about betweenness or closeness. If I'm in a situation where I will only pick one, it's typically a scalability issue. PageRank does well at any scale, and Closeness and Betweenness Centrality do not. Degree Centrality easily scales. Certain algorithms work at even billion scale, and some become unpractical beyond thousand scale.

None of them are better, but you should know several, what they do, and where they do not work well.

My networks are billion scale, so I am usually looking for algorithms that scale well. However, it would be a total rookie mistake to write off any algorithms simply because the network is too large. For instance, you can extract a subset of a massive network and then all algorithms will be useful. And, as I mentioned before, most networks have several large clusters, so networks are commonly analyzed in pieces, anyway.

Final Takeaway

There's something you should keep in mind with regards to using centralities, PageRank, HITS, and other algorithms for determining node importance. Importance is calculated based on a nodes placement in a network. It has to do with position and surroundings, but this is network context.

It is important to think from a network perspective, but do not forget that the network is just the map. Just because two people are connected does not mean they are equally influential. One of them might be stupid and ignored, just tolerated. One of them might be less connected bully, but influential by might. One might be more influential in terms of messaging. And there are always other layers to the network that have not been built into the network. For instance, if a person from one end of the network etched something into a tree and a person from the opposite end of the network read it, then information flowed in another way.

These are tools that are useful from a network perspective, but there is more to the story.

What are you waiting for?

If you find this content interesting, please jump in and give this a try! Install Jupyter or use Google Colab and start exploring. You don't need to know everything on day one. Just get started. Learning to work with networks and explore relationships is powerful, and this skill becomes tremendously useful the deeper you go.

That's Enough for Today

I hope you found this to be an enjoyable read, and I hope my explanations made sense. This blog post was written quickly. If you would like to learn more about networks and network analysis, please buy a copy of my book!

No comments:

Post a Comment

This Blog Has Moved!

This blog has moved to Substack! No more updates will be added to the blogspot blog. I will leave posts here but will not add new ones. New ...