Instagram # Community Detection With Machine Learning

James Phoenix

Finding social communities can help you to easily identity sub-topical trends within a niche.

As businesses continue to use social media, we can harness the power of graph theory to better understand niches, communities and groups of individuals ❤

In this guide you’ll use an Instagram Scraper to create a mini social graph database via NetworkX.

By monitoring the co-occurrences of hashtags, we can create a graph containing nodes and edges (links) which will then be clustered via an unsupervised machine learning technique called community detection.

Pre-requisites:

You will need to have a working version of Google Chrome.
You will need to have installed Anaconda & Jupyter Notebook.

Then clone a version of this repository which contains the InstagramSscraper and InstagramGraph Jupyter Notebook files.


git clone https://github.com/kitsamho/Instagram_Scraper_Graph

Install Chromedriver for Selenium

Now you’ll need to get the related version of Google ChromeDriver from here, its worth noting that the ChromeDriver version must correctly match your Google Chrome version.

Installing All Of The Other Python Packages

Assuming that you’ve installed Anaconda, your missing dependencies will be:

pip install selenium
pip install regex
pip install re
pip install urllib
pip install getpass
pip install emoji
pip install seaborn
pip install spacy
pip install langdetect
pip install plotly
python -m spacy download en_core_web_lg

Change Your Selenium ChromeDriver Files

After downloading the relevant ChromeDriver file, place it within the cloned repository inside of a folder called Chromedrivers.

Then open up the InstagramScraper.ipynb notebook inside of a Jupyter Notebook session. You can run a Jupyter Notebook server with the following command on terminal / command prompt:


Jupyter Notebook

Additionally we will have to update the Python class InstagramScraper(). Make sure to update the driver_loc location to ‘/chromedrivers/chromedriver’


class InstagramScraper():
    def __init__(self,driver_loc='/chromedrivers/chromedriver'):

Implementation

Run The IG Crawler With The InstagramScraper() Class

As described inside of the README.md file:

This class is made up of a series of methods that allow for the scraping of Instagram post data. The pipeline consists of three main methods that need to be called sequentially. There is no current method to chain the whole pipeline.

self.logIn() : user detail capture, WebDriver initialisation, Instagram log in.

self.getLinks() : gets n unique links containing <#HASHTAG> using WebDriver.

self.getData() : implements multi-threaded scraping of data from self.getLinks using a combination of Selenium WebDriver and Beautiful Soup. Method returns a pandas DataFrame.

Here’s an example of how to run the scraper:

# Initalise the scraper
scraper = InstagramScraper()

# Login to Instagram
scraper.logIn()

# Collect Data
scraper.getLinks()

# Get Data
df = scraper.getData()

# This the social media graph that we will use for the network analysis
df.to_csv('social_media_graph_data.csv')

Run The IG Graph Creation Code via The InstagramGraph() Class

After web scraping a specific hashtag on Instagram, you can use the following code to inspect your data via Plotly.

self.getFeatures(translate=False): creates various descriptive metrics from the data.

self.selectData(english=True,remove_verified=True,max_posts=3,lemma=True):Subsets the data across various variables.

self.buildGraph(additional_stopwords=[],min_frequency=5): generates edges and nodes and adds them to an instance of a NetworkX graph object.

self.plotGraph(sizing=75,node_size='adjacency_frequency',layout=nx.kamada_kawai_layout,light_theme=True,colorscale='Viridis',community_plot=False)

self.plotCommunity(colorscale=False): creates a sunburst plot of communities and contributing hashtags

self.savePlot(plot='map'): saves plots as HTML to local directory. Use ‘community’ if community sunburst plot needs to be saved

self.saveTables(): saves all csv files to local directory – node DataFrame, edge DataFrame, initial processed DataFrame and selected DataFrame

Here’s an example of how to use the code:

test= InstagramGraph('social_media_graph_data.csv')
test.getFeatures()

As we can see from the count of hashtags by post, a larger amount of instagram posts contain 27 – 30 hashtags. This makes sense as marketers/users are attempting to maximise their reach with every piece of content.

Now its time to build your social graph, I’ve used the betweeness centrality metric as the node size. This means that more connected nodes will be larger in size and less connected nodes will be smaller in size.

test.selectData()
test.buildGraph(additional_stopwords=['nostalgia'])
test.plotGraph(sizing=100,node_size='betweeness_centrality')

How To Use This:

Hashtags that are closer together represent semantic clusters of hashtags. For example #Fashion and #Style displayed significant overlap, this can help you understand how people are using hashtags. Also it provides you with an effective method for creating semantically related hashtag lists.
As the node size is controlled by betweeness centrality, we can easily see which nodes are well connected within a specific niche. This allows us to easily spot common themes that are associated with the seed hashtag.

Performing Community Detection

You can also perform community detection on the social graph data. Community detection is an unsupervised clustering technique, the objective function will attempt to maximise the links within a group/community whilst minimising the links between communities.

test.plotCommunity()