loader from loading.io

Spam Filtering with Naive Bayes

Data Skeptic

Release Date: 07/27/2018

Why Am I Seeing This? show art Why Am I Seeing This?

Data Skeptic

In this episode of Data Skeptic, we explore the challenges of studying social media recommender systems when exposure data isn't accessible. Our guests Sabrina Guidotti, Gregor Donabauer, and Dimitri Ognibene introduce their innovative "recommender neutral user model" for inferring the influence of opaque algorithms.

info_outline
Eco-aware GNN Recommenders show art Eco-aware GNN Recommenders

Data Skeptic

In this episode of Data Skeptic, we dive into eco-friendly AI with Antonio Purificato, a PhD student from Sapienza University of Rome. Antonio discusses his research on "EcoAware Graph Neural Networks for Sustainable Recommendations" and explores how we can measure and reduce the environmental impact of recommender systems without sacrificing performance.

info_outline
Networks and Recommender Systems show art Networks and Recommender Systems

Data Skeptic

Kyle reveals the next season's topic will be "Recommender Systems".  Asaf shares insights on how network science contributes to the recommender system field.

info_outline
Network of Past Guests Collaborations show art Network of Past Guests Collaborations

Data Skeptic

Kyle and Asaf discuss a project in which we link former guests of the podcast based on their co-authorship of academic papers. 

info_outline
The Network Diversion Problem show art The Network Diversion Problem

Data Skeptic

In this episode, Professor Pål Grønås Drange from the University of Bergen, introduces the field of Parameterized Complexity - a powerful framework for tackling hard computational problems by focusing on specific structural aspects of the input. This framework allows researchers to solve NP-complete problems more efficiently when certain parameters, like the structure of the graph, are "well-behaved". At the center of the discussion is the network diversion problem, where the goal isn’t to block all routes between two points in a network, but to force flow - such as traffic, electricity,...

info_outline
Complex Dynamic in Networks show art Complex Dynamic in Networks

Data Skeptic

In this episode, we learn why simply analyzing the structure of a network is not enough, and how the dynamics - the actual mechanisms of interaction between components - can drastically change how information or influence spreads.  Our guest, Professor Baruch Barzel of Bar-Ilan University, is a leading researcher in network dynamics and complex systems ranging from biology to infrastructure and beyond.  Paper in focus:

info_outline
Github Network Analysis show art Github Network Analysis

Data Skeptic

In this episode we'll discuss how to use Github data as a network to extract insights about teamwork. Our guest, Gabriel Ramirez, manager of the notifications team at GitHub, will show how to apply network analysis to better understand and improve collaboration within his engineering team by analyzing GitHub metadata - such as pull requests, issues, and discussions - as a bipartite graph of people and projects. Some insights we'll discuss are how network centrality measures (like eigenvector and betweenness centrality) reveal organizational dynamics, how vacation patterns influence team...

info_outline
Networks and Complexity show art Networks and Complexity

Data Skeptic

In this episode, Kyle does an overview of the intersection of graph theory and computational complexity theory.  In complexity theory, we are about the runtime of an algorithm based on its input size.  For many graph problems, the interesting questions we want to ask take longer and longer to answer!  This episode provides the fundamental vocabulary and signposts along the path of exploring the intersection of graph theory and computational complexity theory.

info_outline
Graphs for Causal AI show art Graphs for Causal AI

Data Skeptic

How to build artificial intelligence systems that understand cause and effect, moving beyond simple correlations? As we all know, correlation is not causation. "Spurious correlations" can show, for example, how rising ice cream sales might statistically link to more drownings, not because one causes the other, but due to an unobserved common cause like warm weather. Our guest, Utkarshani Jaimini, a researcher from the University of South Carolina's Artificial Intelligence Institute, tries to tackle this problem by using knowledge graphs that incorporate domain expertise.  Knowledge graphs...

info_outline
Power Networks show art Power Networks

Data Skeptic

info_outline
 
More Episodes

Today's spam filters are advanced data driven tools. They rely on a variety of techniques to effectively and often seamlessly filter out junk email from good email.

Whitelists, blacklists, traffic analysis, network analysis, and a variety of other tools are probably employed by most major players in this area. Naturally content analysis can be an especially powerful tool for detecting spam.

Given the binary nature of the problem (Spam or \neg Spam) its clear that this is a great problem to use machine learning to solve. In order to apply machine learning, you first need a labelled training set. Thankfully, many standard corpora of labelled spam data are readily available. Further, if you're working for a company with a spam filtering problem, often asking users to self-moderate or flag things as spam can be an effective way to generate a large amount of labels for "free".

With a labeled dataset in hand, a data scientist working on spam filtering must next do feature engineering. This should be done with consideration of the algorithm that will be used. The Naive Bayesian Classifer has been a popular choice for detecting spam because it tends to perform pretty well on high dimensional data, unlike a lot of other ML algorithms. It also is very efficient to compute, making it possible to train a per-user Classifier if one wished to. While we might do some basic NLP tricks, for the most part, we can turn each word in a document (or perhaps each bigram or n-gram in a document) into a feature.

The Naive part of the Naive Bayesian Classifier stems from the naive assumption that all features in one's analysis are considered to be independent. If x and y are known to be independent, then Pr(x \cap y) = Pr(x) \cdot Pr(y). In other words, you just multiply the probabilities together. Shh, don't tell anyone, but this assumption is actually wrong! Certainly, if a document contains the word algorithm, it's more likely to contain the word probability than some randomly selected document. Thus, Pr(\text{algorithm} \cap \text{probability}) > Pr(\text{algorithm}) \cdot Pr(\text{probability}), violating the assumption. Despite this "flaw", the Naive Bayesian Classifier works remarkably will on many problems. If one employs the common approach of converting a document into bigrams (pairs of words instead of single words), then you can capture a good deal of this correlation indirectly.

In the final leg of the discussion, we explore the question of whether or not a Naive Bayesian Classifier would be a good choice for detecting fake news.