loader from loading.io

Spam Filtering with Naive Bayes

Data Skeptic

Release Date: 07/27/2018

Complex Dynamic in Networks show art Complex Dynamic in Networks

Data Skeptic

In this episode, we learn why simply analyzing the structure of a network is not enough, and how the dynamics - the actual mechanisms of interaction between components - can drastically change how information or influence spreads.  Our guest, Professor Baruch Barzel of Bar-Ilan University, is a leading researcher in network dynamics and complex systems ranging from biology to infrastructure and beyond.  Paper in focus:

info_outline
Github Network Analysis show art Github Network Analysis

Data Skeptic

In this episode we'll discuss how to use Github data as a network to extract insights about teamwork. Our guest, Gabriel Ramirez, manager of the notifications team at GitHub, will show how to apply network analysis to better understand and improve collaboration within his engineering team by analyzing GitHub metadata - such as pull requests, issues, and discussions - as a bipartite graph of people and projects. Some insights we'll discuss are how network centrality measures (like eigenvector and betweenness centrality) reveal organizational dynamics, how vacation patterns influence team...

info_outline
Networks and Complexity show art Networks and Complexity

Data Skeptic

In this episode, Kyle does an overview of the intersection of graph theory and computational complexity theory.  In complexity theory, we are about the runtime of an algorithm based on its input size.  For many graph problems, the interesting questions we want to ask take longer and longer to answer!  This episode provides the fundamental vocabulary and signposts along the path of exploring the intersection of graph theory and computational complexity theory.

info_outline
Actantial Networks show art Actantial Networks

Data Skeptic

In this episode, listeners will learn about Actantial Networks—graph-based representations of narratives where nodes are actors (such as people, institutions, or abstract entities) and edges represent the actions or relationships between them.  The one who will present these networks is our guest Armin Pournaki, a joint PhD candidate at the Max Planck Institute for Mathematics in the Sciences and the Laboratoire Lattice (ENS-PSL), who specializes in computational social science, where he develops methods to extract and analyze political narratives using natural language processing and...

info_outline
Graphs for Causal AI show art Graphs for Causal AI

Data Skeptic

How to build artificial intelligence systems that understand cause and effect, moving beyond simple correlations? As we all know, correlation is not causation. "Spurious correlations" can show, for example, how rising ice cream sales might statistically link to more drownings, not because one causes the other, but due to an unobserved common cause like warm weather. Our guest, Utkarshani Jaimini, a researcher from the University of South Carolina's Artificial Intelligence Institute, tries to tackle this problem by using knowledge graphs that incorporate domain expertise.  Knowledge graphs...

info_outline
Power Networks show art Power Networks

Data Skeptic

info_outline
Unveiling Graph Datasets show art Unveiling Graph Datasets

Data Skeptic

info_outline
Network Manipulation show art Network Manipulation

Data Skeptic

In this episode we talk with Manita Pote, a PhD student at Indiana University Bloomington, specializing in online trust and safety, with a focus on detecting coordinated manipulation campaigns on social media.  Key insights include how coordinated reply attacks target influential figures like journalists and politicians, how machine learning models can detect these inauthentic campaigns using structural and behavioral features, and how deletion patterns reveal efforts to evade moderation or manipulate engagement metrics. Follow our guest Papers in focus

info_outline
The Small World Hypothesis show art The Small World Hypothesis

Data Skeptic

Kyle discusses the history and proof for the small world hypothesis.

info_outline
Thinking in Networks show art Thinking in Networks

Data Skeptic

Kyle asks Asaf questions about the new network science course he is now teaching.  The conversation delves into topics such as contact tracing, tools for analyzing networks, example use cases, and the importance of thinking in networks.

info_outline
 
More Episodes

Today's spam filters are advanced data driven tools. They rely on a variety of techniques to effectively and often seamlessly filter out junk email from good email.

Whitelists, blacklists, traffic analysis, network analysis, and a variety of other tools are probably employed by most major players in this area. Naturally content analysis can be an especially powerful tool for detecting spam.

Given the binary nature of the problem (Spam or \neg Spam) its clear that this is a great problem to use machine learning to solve. In order to apply machine learning, you first need a labelled training set. Thankfully, many standard corpora of labelled spam data are readily available. Further, if you're working for a company with a spam filtering problem, often asking users to self-moderate or flag things as spam can be an effective way to generate a large amount of labels for "free".

With a labeled dataset in hand, a data scientist working on spam filtering must next do feature engineering. This should be done with consideration of the algorithm that will be used. The Naive Bayesian Classifer has been a popular choice for detecting spam because it tends to perform pretty well on high dimensional data, unlike a lot of other ML algorithms. It also is very efficient to compute, making it possible to train a per-user Classifier if one wished to. While we might do some basic NLP tricks, for the most part, we can turn each word in a document (or perhaps each bigram or n-gram in a document) into a feature.

The Naive part of the Naive Bayesian Classifier stems from the naive assumption that all features in one's analysis are considered to be independent. If x and y are known to be independent, then Pr(x \cap y) = Pr(x) \cdot Pr(y). In other words, you just multiply the probabilities together. Shh, don't tell anyone, but this assumption is actually wrong! Certainly, if a document contains the word algorithm, it's more likely to contain the word probability than some randomly selected document. Thus, Pr(\text{algorithm} \cap \text{probability}) > Pr(\text{algorithm}) \cdot Pr(\text{probability}), violating the assumption. Despite this "flaw", the Naive Bayesian Classifier works remarkably will on many problems. If one employs the common approach of converting a document into bigrams (pairs of words instead of single words), then you can capture a good deal of this correlation indirectly.

In the final leg of the discussion, we explore the question of whether or not a Naive Bayesian Classifier would be a good choice for detecting fake news.