Using Neural Networks to Identify Hate Speech

Zachary Kitt
2 min readDec 1, 2018

Last year, researchers from Cornell published a paper describing their work classifying tweets as hateful, offensive, or neither. They experimented with a handful of models and parameters before settling on logistic regression. In addition to the effort spent building and evaluating models, they also extracted a handful of features for their model beyond the text itself, including the readability, sentiment, and metadata of each tweet. This was a lot of work, but the results were promising.

I wanted a similar classifier for my own purposes, but I didn’t have the patience necessary to extract the same data. Instead, I decided to leverage a technique that would require no pre-processing on my end: a convolutional neural network. Using this method, words are represented by vectors (thanks to pre-trained word embeddings), and features are selected by the neural network itself.

I implemented the CNN architecture suggested by Yoon Kim, deviating only in the word embeddings, the number of filters generated at the convolutional layer, and the error metric. The resulting model performed as well as, if not better than, the model used by the Cornell researchers when comparing overall F1 scores on the same data (0.91). This is gratifying, because I didn’t optimize my model’s parameters, let alone extract any features.

However, at the classification level, my model didn’t do as good a job at differentiating between hate speech and offensive speech. This is an important distinction to make, as it may be the difference between what is legal and illegal in some localities. Contradictions within the source data may be responsible for this confusion. “Tired of hoes man” is labeled as offensive, while “some lying ass hoes lol” is labeled as hate speech. Although both are offensive and hateful, I can’t say why they receive separate designations. This problem arises for a multitude of racial and gender-based slurs.

And although the previous examples show that annotators were not working with clearly defined categories, problems extend beyond that. A tweet stating that the “Lakers are trash right now” was labeled as hate speech. I’m willing to give that person a pass, even though they have a questionable opinion (go Lakers!). The point is, if I don’t know what the annotators were thinking, I can’t expect my model to.

So what about future work? Right now, I have a reasonably accurate (90%) classifier of offensive tweets that I can run against individual accounts or topics. Alternatively, I could try to create a better dataset. This seems hard, because people have different opinions about what hate speech actually is, but I do think that a good first step in that direction would be building a classifier that identifies genocidal or violent messages. This is both a clearer definition of hate and a more immediate concern in an era of mass shootings and populist leanings.

My code is available here.

Originally published at



Zachary Kitt

Writer of code. Interested in data-driven policy. Graduate of @JacksonYale and @UCSBGlobal.