Finding social communities can help you to easily identity sub-topical trends within a niche.
As businesses continue to use social media, we can harness the power of graph theory to better understand niches, communities and groups of individuals ❤
In this guide you’ll use an Instagram Scraper to create a mini social graph database via NetworkX.
By monitoring the co-occurrences of hashtags, we can create a graph containing nodes and edges (links) which will then be clustered via an unsupervised machine learning technique called community detection.
- You will need to have a working version of Google Chrome.
- You will need to have installed Anaconda & Jupyter Notebook.
Then clone a version of this repository which contains the InstagramSscraper and InstagramGraph Jupyter Notebook files.
git clone https://github.com/kitsamho/Instagram_Scraper_Graph
Install Chromedriver for Selenium
Now you’ll need to get the related version of Google ChromeDriver from here, its worth noting that the ChromeDriver version must correctly match your Google Chrome version.
Installing All Of The Other Python Packages
Assuming that you’ve installed Anaconda, your missing dependencies will be:
pip install selenium pip install regex pip install re pip install urllib pip install getpass pip install emoji pip install seaborn pip install spacy pip install langdetect pip install plotly python -m spacy download en_core_web_lg
Change Your Selenium ChromeDriver Files
After downloading the relevant ChromeDriver file, place it within the cloned repository inside of a folder called Chromedrivers.
Then open up the InstagramScraper.ipynb notebook inside of a Jupyter Notebook session. You can run a Jupyter Notebook server with the following command on terminal / command prompt:
Additionally we will have to update the Python class InstagramScraper(). Make sure to update the driver_loc location to ‘/chromedrivers/chromedriver’
class InstagramScraper(): def __init__(self,driver_loc='/chromedrivers/chromedriver'):
Run The IG Crawler With The InstagramScraper() Class
As described inside of the README.md file:
This class is made up of a series of methods that allow for the scraping of Instagram post data. The pipeline consists of three main methods that need to be called sequentially. There is no current method to chain the whole pipeline.
self.logIn() : user detail capture, WebDriver initialisation, Instagram log in.
self.getLinks() : gets n unique links containing <#HASHTAG> using WebDriver.
self.getData() : implements multi-threaded scraping of data from self.getLinks using a combination of Selenium WebDriver and Beautiful Soup. Method returns a pandas DataFrame.
Here’s an example of how to run the scraper:
# Initalise the scraper scraper = InstagramScraper() # Login to Instagram scraper.logIn() # Collect Data scraper.getLinks() # Get Data df = scraper.getData() # This the social media graph that we will use for the network analysis df.to_csv('social_media_graph_data.csv')
Run The IG Graph Creation Code via The InstagramGraph() Class
After web scraping a specific hashtag on Instagram, you can use the following code to inspect your data via Plotly.
self.getFeatures(translate=False): creates various descriptive metrics from the data.
self.selectData(english=True,remove_verified=True,max_posts=3,lemma=True):Subsets the data across various variables.
self.buildGraph(additional_stopwords=,min_frequency=5): generates edges and nodes and adds them to an instance of a NetworkX graph object.
self.plotCommunity(colorscale=False): creates a sunburst plot of communities and contributing hashtags
self.savePlot(plot='map'): saves plots as HTML to local directory. Use ‘community’ if community sunburst plot needs to be saved
self.saveTables(): saves all csv files to local directory – node DataFrame, edge DataFrame, initial processed DataFrame and selected DataFrame
Here’s an example of how to use the code:
test= InstagramGraph('social_media_graph_data.csv') test.getFeatures()
As we can see from the count of hashtags by post, a larger amount of instagram posts contain 27 – 30 hashtags. This makes sense as marketers/users are attempting to maximise their reach with every piece of content.
Now its time to build your social graph, I’ve used the betweeness centrality metric as the node size. This means that more connected nodes will be larger in size and less connected nodes will be smaller in size.
test.selectData() test.buildGraph(additional_stopwords=['nostalgia']) test.plotGraph(sizing=100,node_size='betweeness_centrality')
How To Use This:
- Hashtags that are closer together represent semantic clusters of hashtags. For example #Fashion and #Style displayed significant overlap, this can help you understand how people are using hashtags. Also it provides you with an effective method for creating semantically related hashtag lists.
- As the node size is controlled by betweeness centrality, we can easily see which nodes are well connected within a specific niche. This allows us to easily spot common themes that are associated with the seed hashtag.
Performing Community Detection
You can also perform community detection on the social graph data. Community detection is an unsupervised clustering technique, the objective function will attempt to maximise the links within a group/community whilst minimising the links between communities.
How To Use This:
- Community detection provides you with several groups, helping you to analyse the data better. For example, community 0 appears to be focused around hair, haircut and hairstyle in contrast to community 2 which focuses on fashion, style and makeup.
- You can create audience personas from community detection, alternatively you can use it to see closely related nodes to differentiate between types of behaviours / people / activities within a specific niche.