Knn Applied to Classifying Male/Female Subreddits

Abstract

By the end of this post, it is the hope that the enthusiastic reader will be able to implement a knn-algorithim using Python to classify images from different subreddits of their choice.

What your getting into

  • ~10 minute read for the short version, 1hr for the long (nitty gritty) detailed version
  • An example of a K nearest neighbours (knn) classifier
  • How to download thousands of images from subreddits
  • Learning about image processing and data manipulation in Python.

Sweet Intentions

I wasn’t really born with a sweet tooth. I just don’t like sweet things with few somewhat almost universal exceptions:

  1. Cheesecake of any kind
  2. Black rice pudding
  3. Sweet Caroline (bop, bop, bahhh)
  4. Eye-candy

That last point (arguably shoe horned in) is probably the most universal of them all. Regardless of ear-drums and taste-buds varying from person to person, visually some people are so handsome or pretty we feel compelled to simply admire their beauty. It is perhaps unsurprising then that the #107 and #265 top subreddits are PG-13 eye-candy based. More surprising are their unique names /r/gentlemanboners and /r/ladyboners, pushing the limit of oxymoronic compound words.

Perhaps it is their names that has gathered so much intrigue to their little corner of the internet; over 360k and 150k men and women have subscribed to add sprinklings of eye-candy to their Reddit feeds. Clearly these men and women (or at least the majority) are not subscribing simply for the name but the content provided by the subreddit. It was with this intention that I set out to classify the images of each subreddit and see if I could learn something about the viewing preferences of men and women in relation to viewing the opposite sex along the way.

Data Collection

First we need to collect images from each subreddit. For that we use the absolutely fantastic program RedditImageGrab.  If we want to download the top 1000 pictures from /r/ladyboners we would run the following command in the terminal.

Data Preparation

To prepare the data for sampling all images were reduced using PIL while maintaining aspect ratios to a minimum height or width of 256 pixels, and the remaining picture had the top 63 colours selected as shown below with Alison Brie and Hugh Jackman. Before resizing the images the dimensions of the images were retained, this will turn out to be vastly more important, than the colours. However for pedagogical purposes a complete analysis including colours is used for two reasons

  1. kNN extends easily in higher dimensional spaces, while the addition of colours may not vastly improve the result there is a small amount gained by the additional features for little cost.
  2. For other subreddits colour can be used with a relatively high success rate. For example one can run this program on the /r/earthporn and /r/urbanhell subreddits and achieve a ~78% success rate using colour alone

Example of images being prepared for the knn algorithm through downsampling

Data Analysis

A 1000 images were prepared from each subreddit and put into a Pandas dataframe for various manipulations. A kNN model was trained using either the RGB or HSV and the original height and width of each image as a feature. The images’ corresponding subreddit as a target where /r/gentlemanboners and /r/ladyboners were set to 0 and 1 respectively as their target value. More specifically we train SciPy’s implementation of kNN

using 2/3rds of our data set and test the remaining third.  A 63×5  array is assessed where we have 63 different RGB or HSV values and two constant columns of height or width in our parameter space. We then take a majority vote from the predicted results of the array to predict the final classification of the picture. It is important that the amount of RGB and HSV present is an odd number to prevent ties in voting (e.g. if 23 say it’s from /r/gentlemanboners and 40 say /r/ladyboners we predict the latter).

We are able to attain a 84% success rate in classifying images. Indicating that there is at least some clustering going on, encouraging us to search for distinguishing features between the two subreddits using the parameters of our model. For readers who are unsure exactly what the knn algorithm is here is a great youtube video on how knn works.

Data Inference

The RGB and HSV were plotted to see if there were any clear colour preferences. As mentioned in the preparation section, little correlation between the sexes was found for colour.

Plots of RGB & HSV values to see if any local clustering is apparent for the knn to benefit from

Clearly no clustering of significance is readily apparent here, implying that the majority of the classification is a consequence of the images’ dimensions. To see this we plot the image dimensions in a 3D plot, setting the z direction as the mean value of targets For example, if the resolution (1000×1100) has 8 women and 4 men it would be plotted at (1000,1100,0.66) in space. A colour map was used to help visualize the differences between the clusters.

Plots of RGB & HSV values to see if any local clustering is apparent for the knn to benefit from

Wow! Okay, now we’re getting somewhere there clearly to be some trends in the size of images submitted to each subreddit. Lets quantify these differences a bit better by plotting the respective KDEs for resolutions predominantly held by men or women where predominantly” is defined as a mean less than 0.2 for men and greater than 0.8 for women.

KDEs highlighting the clustering which helps the knn to succeedMining the Subreddit Rules and Viewer Preferences

An example of an underlying rule found by using knnAnd the differences beome readily clear. /r/ladyboners clusters at a much lower resolution than /r/gentlemanboners. In fact it seems peculiar and suspicious just how large all entries for /r/gentlemanboners are, a good scientist here takes a step back and asks if they have made a mistake. An investigation of the subreddit shows we are – in fact – correct, and have actually found an underlying rule seperating the two subreddits. Looking at the sidebar of /r/gentlemanboners we see a set of rules which govern the subreddit, one of which is a minimum picture size. No such set of rules exist for /r/ladyboners!
Three other direct features jump out:

  1. A large portion of men submit very high resolution images
  2. People seem to like capping pictures at a 3000 pixel height resolution
  3. While many pictures scale for women vertically the same way as for men a stronger trend seems to exist for pictures submitted to/r/ladyboners to also scale horizontally 

The first point may be a by product of Rule IV of /r/gentlemanboners as shown in the pictures. The second point is a bit more mysterious, I am not sure why 3000 is such a popular vertical resolution but perhaps if you know why this could be, leave a comment below this post. The third point is the most interesting one in my opinion. There seems to be a weak trend for women to submit/upvote  pictures with a wider aspect ratio on /r/ladyboners than their male counterparts on /r/gentlemanboners, however pinning down the reason (or proving the difference is even statistically relevant) is beyond the scope of this post.
Rose Leslie (Ygritte Game of Thrones) and a really good looking guy

The Nitty Gritties (knn and PIL in python)

Thus far, a general path to quantify the differences between the two subreddits: /r/gentlemanboners and /r/ladyboners has been laid out. This section tackles the impementation of the code allowing the reader to reproduce/understand the code for their own purposes. We begin assuming all images have been downloaded as laid out in the “Data Collection” section above.

Were going to need a few packages so lets load those in

Defining Functions

Okay great, now that we have the libraries lets write some functions in advance to use later.  The first function is a short code snippet, which returns a dataframe with it’s indexes mixed up, this will be useful later when we wish to randomly split all the data into a training and testing set. The next function relies on the Python PIL library to collect the image dimensions of a photo and than return a down-sampled version of the image.

Lets take it out for a spin

Down sampled picture of Alison Brie and for the ladies in the house.

Down sampled picture of Hugh Jackman

Splitting the Data

Okay great, so now we can load down-sampled images so our laptops don’t explode like hot potatoes during analysis. The next thing to do is to gather a list of all image paths and split them into training and testing lists.

Gathering Attributes

With these lists ready we can now extract the colours (and original dimensions) present in the down-sampled images and dump them into some pickles to make our lives easier for reading in later.

Cleaning the Structure

Currently everything is stored in a pickled list of lists with lists, really not the cleanest way to present the data so we should load them into dataframes. We write a function to read in a pickle and return two data frames one with RGB and HSV values.

Exploratory Data Analysis

Now we can plot the colours to see if there are anything of value from the RGB and HSV info we extracted

RGB and HSV plots show little clustering for the knn to benefit fromThe answer of which is no. However for other subreddits this is not necessarily the case. The use of colour alone can classify images between /r/earthporn and /r/urbanhell with a 78% success rate so if you are going to analyse two of your favourite subreddits don’t necessarily write this stage out.

Training the KNN Model

Now we load in the testing dataframe…

and than call upon SciPy to train the classifier…

Run the KNN Model

and run it! HOWEVER we have a choice here. Either we take an individual vote from each colour and than classify the image based on a majority vote; OR we take all colours from the pictures and assess them directly in space averaging all votes with no rounding. We have done the latter, one can remove a few lines of code and modify the below cell to try doing the prior.

and we get something like

Inspecting Results

Cool so the knn algorithm seems to give adequate results, now it’s time to try and figure out why and reproduce the work at the very beginning of this post. We know colour apparently didn’t give us much so lets plot the image dimensions.

Scatter plot of all image resolution, no clustering readily apparent although aspect ratio trends begin to show

Mmmmm, not so clear. Lets try plotting them by their mean value. Where on the spectrum of [0,1] does a given image dimension lie? Is it closer to men’s preferences or women’s preferences.

Scatter plot to view distribution

Summarizing our Findings

Okay so there are clearly preferences but still not clear why the knn works so well, let’s plot the KDEs to help clear it up a bit more.

KDEs highlighting the clustering which helps the knn to succeed

Finally the trend shows itself. It is clear that the knn works as well as it does because the large majority of women’s images have smaller resolutions then men’s. There also seems to be a bit more of a trend for women to prefer or at least be indifferent to pictures with a wiser aspect ratio. Pretty neat!

What Did You Think?

shareEnjoy the project? Have a question about the code? Are there any two subreddits you would like to apply the knn algorithm to? Please tweet at me on my twitter, like the Facebook page or leave a comment below. Receiving Feedback is one of the most rewarding part of create visualizations/sharing projects and I would truly love to see hear what others think! Besides I’m pretty sure nobody reads these parts of my posts so leaving a comment would make you part of the super elite very thorough readers club… maybe if I put a gif…

Leave a Reply

Time limit is exhausted. Please reload CAPTCHA.