Data Skeptic
Machine learning, AI, and data science explored through interviews with experts, explainer episodes, and a broad survey of how technology is changing our world.
info_outline
Designing Recommender Systems for Digital Humanities
11/23/2025
Designing Recommender Systems for Digital Humanities
In this episode of Data Skeptic, we explore the fascinating intersection of recommender systems and digital humanities with guest Florian Atzenhofer-Baumgartner, a PhD student at Graz University of Technology. Florian is working on , Europe's largest online collection of historical charters, containing millions of medieval and early modern documents from across the continent. The conversation delves into why traditional recommender systems fall short in the digital humanities space, where users range from expert historians and genealogists to art historians and linguists, each with unique research needs and information-seeking behaviors. Florian explains the technical challenges of building a recommender system for cultural heritage materials, including dealing with sparse user-item interaction matrices, the cold start problem, and the need for multi-modal similarity approaches that can handle text, images, metadata, and historical context. The platform leverages various embedding techniques and gives users control over weighting different modalities—whether they're searching based on text similarity, visual imagery, or diplomatic features like issuers and receivers. A key insight from Florian's research is the importance of balancing serendipity with utility, collection representation to prevent bias, and system explainability while maintaining effectiveness. The discussion also touches on unique evaluation challenges in non-commercial recommendation contexts, including Florian's "research funnel" framework that considers discovery, interaction, integration, and impact stages. Looking ahead, Florian envisions recommendation systems becoming standard tools for exploration across digital archives and cultural heritage repositories throughout Europe, potentially transforming how researchers discover and engage with historical materials. The new version of , set to launch with enhanced semantic search and recommendation features, represents an important step toward making cultural heritage more accessible and discoverable for everyone.
/episode/index/show/dataskeptic/id/39139150
info_outline
DataRec Library for Reproducible in Recommend Systems
11/13/2025
DataRec Library for Reproducible in Recommend Systems
In this episode of Data Skeptic's Recommender Systems series, host Kyle Polich explores DataRec, a new Python library designed to bring reproducibility and standardization to recommender systems research. Guest Alberto Carlo Maria Mancino, a postdoc researcher from Politecnico di Bari, Italy, discusses the challenges of dataset management in recommendation research—from version control issues to preprocessing inconsistencies—and how DataRec provides automated downloads, checksum verification, and standardized filtering strategies for popular datasets like MovieLens, Last.fm, and Amazon reviews. The conversation covers Alberto's research journey through knowledge graphs, graph-based recommenders, privacy considerations, and recommendation novelty. He explains why small modifications in datasets can significantly impact research outcomes, the importance of offline evaluation, and DataRec's vision as a lightweight library that integrates with existing frameworks rather than replacing them. Whether you're benchmarking new algorithms or exploring recommendation techniques, this episode offers practical insights into one of the most critical yet overlooked aspects of reproducible ML research.
/episode/index/show/dataskeptic/id/39027820
info_outline
Shilling Attacks on Recommender Systems
11/05/2025
Shilling Attacks on Recommender Systems
In this episode of Data Skeptic's Recommender Systems series, Kyle sits down with Aditya Chichani, a senior machine learning engineer at Walmart, to explore the darker side of recommendation algorithms. The conversation centers on shilling attacks—a form of manipulation where malicious actors create multiple fake profiles to game recommender systems, either to promote specific items or sabotage competitors. Aditya, who researched these attacks during his undergraduate studies at SPIT before completing his master's in computer science with a data science specialization at UC Berkeley, explains how these vulnerabilities emerge particularly in collaborative filtering systems. From promoting a friend's ska band on Spotify to inflating product ratings on e-commerce platforms, shilling attacks represent a significant threat in an industry where approximately 4% of reviews are fake, translating to $800 billion in annual sales in the US alone. The discussion delves deep into collaborative filtering, explaining both user-user and item-item approaches that create similarity matrices to predict user preferences. However, these systems face various shilling attacks of increasing sophistication: random attacks use minimal information with average ratings, while segmented attacks strategically target popular items (like Taylor Swift albums) to build credibility before promoting target items. Bandwagon attacks focus on highly popular items to connect with genuine users, and average attacks leverage item rating knowledge to appear authentic. User-user collaborative filtering proves particularly vulnerable, requiring as few as 500 fake profiles to impact recommendations, while item-item filtering demands significantly more resources. Aditya addresses detection through machine learning techniques that analyze behavioral patterns using methods like PCA to identify profiles with unusually high correlation and suspicious rating consistency. However, this remains an evolving challenge as attackers adapt strategies, now using large language models to generate more authentic-seeming fake reviews. His research with the MovieLens dataset tested detection algorithms against synthetic attacks, highlighting how these concerns extend to modern e-commerce systems. While companies rarely share attack and detection data publicly to avoid giving attackers advantages, academic research continues advancing both offensive and defensive strategies in recommender systems security.
/episode/index/show/dataskeptic/id/38923335
info_outline
Music Playlist Recommendations
10/29/2025
Music Playlist Recommendations
In this episode, Rebecca Salganik, a PhD student at the University of Rochester with a background in vocal performance and composition, discusses her research on fairness in music recommendation systems. She explores three key types of fairness—group, individual, and counterfactual—and examines how algorithms create challenges like popularity bias (favoring mainstream content) and multi-interest bias (underserving users with diverse tastes). Rebecca introduces LARP, her multi-stage multimodal framework for playlist continuation that uses contrastive learning to align text and audio representations, learn song relationships, and create playlist-level embeddings to address the cold start problem. A significant contribution of Rebecca's work is the Music Semantics dataset, created by scraping Reddit discussions to capture how people naturally describe music using atmospheric qualities, contextual comparisons, and situational associations rather than just technical features. This dataset, available on Hugging Face, enables more nuanced recommendation systems that better understand user preferences and support niche tastes. Her research utilizes industry datasets including and Spotify's Million Playlist Dataset, and points toward exciting future applications in music generation and multimodal systems that combine audio, text, and video.
/episode/index/show/dataskeptic/id/38821500
info_outline
Bypassing the Popularity Bias
10/15/2025
Bypassing the Popularity Bias
/episode/index/show/dataskeptic/id/38592885
info_outline
Sustainable Recommender Systems for Tourism
10/09/2025
Sustainable Recommender Systems for Tourism
In this episode, we speak with Ashmi Banerjee, a doctoral candidate at the Technical University of Munich, about her pioneering research on AI-powered recommender systems in tourism. Ashmi illuminates how these systems can address exposure bias while promoting more sustainable tourism practices through innovative approaches to data acquisition and algorithm design. Key highlights include leveraging large language models for synthetic data generation, developing recommendation architectures that balance user satisfaction with environmental concerns, and creating frameworks that distribute tourism more equitably across destinations. Ashmi's insights offer valuable perspectives for both AI researchers and tourism industry professionals seeking to implement more responsible recommendation technologies.
/episode/index/show/dataskeptic/id/38520995
info_outline
Interpretable Real Estate Recommendations
09/22/2025
Interpretable Real Estate Recommendations
In this episode of Data Skeptic's Recommender Systems series, host Kyle Polich interviews Dr. Kunal Mukherjee, a postdoctoral research associate at Virginia Tech, about the paper "Z-REx: Human-Interpretable GNN Explanations for Real Estate Recommendations" The discussion explores how the post-COVID real estate landscape has created a need for better recommendation systems that can introduce home buyers to emerging neighborhoods they might not know about. Dr. Mukherjee, explains how his team developed a graph neural network approach that not only recommends properties but provides human-interpretable explanations for why certain regions are suggested. The conversation covers the advantages of using graph-based models over traditional recommendation systems, the importance of regional context in real estate features, and how co-click data from similar users can create more effective recommendations. Key topics include the distinction between model developer explanations and end-user explanations, the challenges of feature perturbation in recommendation systems, and how graph neural networks can discover novel pathways to emerging real estate markets that traditional models might miss.
/episode/index/show/dataskeptic/id/38317000
info_outline
Why Am I Seeing This?
09/08/2025
Why Am I Seeing This?
In this episode of Data Skeptic, we explore the challenges of studying social media recommender systems when exposure data isn't accessible. Our guests Sabrina Guidotti, Gregor Donabauer, and Dimitri Ognibene introduce their innovative "recommender neutral user model" for inferring the influence of opaque algorithms.
/episode/index/show/dataskeptic/id/38134315
info_outline
Eco-aware GNN Recommenders
08/30/2025
Eco-aware GNN Recommenders
In this episode of Data Skeptic, we dive into eco-friendly AI with Antonio Purificato, a PhD student from Sapienza University of Rome. Antonio discusses his research on "EcoAware Graph Neural Networks for Sustainable Recommendations" and explores how we can measure and reduce the environmental impact of recommender systems without sacrificing performance.
/episode/index/show/dataskeptic/id/38030325
info_outline
Networks and Recommender Systems
08/17/2025
Networks and Recommender Systems
Kyle reveals the next season's topic will be "Recommender Systems". Asaf shares insights on how network science contributes to the recommender system field.
/episode/index/show/dataskeptic/id/37855470
info_outline
Network of Past Guests Collaborations
07/21/2025
Network of Past Guests Collaborations
Kyle and Asaf discuss a project in which we link former guests of the podcast based on their co-authorship of academic papers.
/episode/index/show/dataskeptic/id/37497525
info_outline
The Network Diversion Problem
07/06/2025
The Network Diversion Problem
In this episode, Professor Pål Grønås Drange from the University of Bergen, introduces the field of Parameterized Complexity - a powerful framework for tackling hard computational problems by focusing on specific structural aspects of the input. This framework allows researchers to solve NP-complete problems more efficiently when certain parameters, like the structure of the graph, are "well-behaved". At the center of the discussion is the network diversion problem, where the goal isn’t to block all routes between two points in a network, but to force flow - such as traffic, electricity, or data - through a specific path. While this problem appears deceptively similar to the classic "Min.Cut/Max.Flow" algorithm, it turns out to be much harder and, in general, its complexity is still unknown. Parameterized complexity plays a key role here by offering ways to make the problem tractable under constraints like low treewidth or planarity, which often exist in real-world networks like road systems or utility grids. Listeners will learn how vulnerability measures help identify weak points in networks, such as geopolitical infrastructure (e.g., gas pipelines like Nord Stream). Follow out guest:
/episode/index/show/dataskeptic/id/37300825
info_outline
Complex Dynamic in Networks
06/28/2025
Complex Dynamic in Networks
In this episode, we learn why simply analyzing the structure of a network is not enough, and how the dynamics - the actual mechanisms of interaction between components - can drastically change how information or influence spreads. Our guest, Professor Baruch Barzel of Bar-Ilan University, is a leading researcher in network dynamics and complex systems ranging from biology to infrastructure and beyond. Paper in focus:
/episode/index/show/dataskeptic/id/37203460
info_outline
Github Network Analysis
06/22/2025
Github Network Analysis
In this episode we'll discuss how to use Github data as a network to extract insights about teamwork. Our guest, Gabriel Ramirez, manager of the notifications team at GitHub, will show how to apply network analysis to better understand and improve collaboration within his engineering team by analyzing GitHub metadata - such as pull requests, issues, and discussions - as a bipartite graph of people and projects. Some insights we'll discuss are how network centrality measures (like eigenvector and betweenness centrality) reveal organizational dynamics, how vacation patterns influence team connectivity, and how decentralizing communication hubs can foster healthier collaboration. Gabriel’s open-source project, GH Graph Explorer, enables other managers and engineers to extract, visualize, and analyze their own GitHub activity using tools like Python, Neo4j, Gephi and LLMs for insight generation, but always remember – don't take the results on face value. Instead, use the results to guide your qualitative investigation.
/episode/index/show/dataskeptic/id/37101640
info_outline
Networks and Complexity
06/14/2025
Networks and Complexity
In this episode, Kyle does an overview of the intersection of graph theory and computational complexity theory. In complexity theory, we are about the runtime of an algorithm based on its input size. For many graph problems, the interesting questions we want to ask take longer and longer to answer! This episode provides the fundamental vocabulary and signposts along the path of exploring the intersection of graph theory and computational complexity theory.
/episode/index/show/dataskeptic/id/37008275
info_outline
Graphs for Causal AI
05/24/2025
Graphs for Causal AI
How to build artificial intelligence systems that understand cause and effect, moving beyond simple correlations? As we all know, correlation is not causation. "Spurious correlations" can show, for example, how rising ice cream sales might statistically link to more drownings, not because one causes the other, but due to an unobserved common cause like warm weather. Our guest, Utkarshani Jaimini, a researcher from the University of South Carolina's Artificial Intelligence Institute, tries to tackle this problem by using knowledge graphs that incorporate domain expertise. Knowledge graphs (structured representations of information) are combined with neural networks in the field of neurosymbolic AI to represent and reason about complex relationships. This involves creating causal ontologies, incorporating the "weight" or strength of causal relationships and hyperrelations. This field has many practical applications such as for AI explainability, healthcare and autonomous driving. Follow our guest Papers in focus
/episode/index/show/dataskeptic/id/36698640
info_outline
Unveiling Graph Datasets
05/08/2025
Unveiling Graph Datasets
/episode/index/show/dataskeptic/id/36501740
info_outline
Network Manipulation
04/30/2025
Network Manipulation
In this episode we talk with Manita Pote, a PhD student at Indiana University Bloomington, specializing in online trust and safety, with a focus on detecting coordinated manipulation campaigns on social media. Key insights include how coordinated reply attacks target influential figures like journalists and politicians, how machine learning models can detect these inauthentic campaigns using structural and behavioral features, and how deletion patterns reveal efforts to evade moderation or manipulate engagement metrics. Follow our guest Papers in focus
/episode/index/show/dataskeptic/id/36366205
info_outline
The Small World Hypothesis
04/21/2025
The Small World Hypothesis
Kyle discusses the history and proof for the small world hypothesis.
/episode/index/show/dataskeptic/id/36241045
info_outline
Thinking in Networks
04/12/2025
Thinking in Networks
Kyle asks Asaf questions about the new network science course he is now teaching. The conversation delves into topics such as contact tracing, tools for analyzing networks, example use cases, and the importance of thinking in networks.
/episode/index/show/dataskeptic/id/36110015
info_outline
Fraud Networks
04/01/2025
Fraud Networks
In this episode we talk with Bavo DC Campo, a data scientist and statistician, who shares his expertise on the intersection of actuarial science, fraud detection, and social network analytics. Together we will learn how to use graphs to fight against insurance fraud by uncovering hidden connections between fraudulent claims and bad actors. Key insights include how social network analytics can detect fraud rings by mapping relationships between policyholders, claims, and service providers, and how the BiRank algorithm, inspired by Google’s PageRank, helps rank suspicious claims based on network structure. Bavo will also present his iFraud simulator that can be used to model fraudulent networks for detection training purposes. Do you have a question about fraud detection? Bavo says he will gladly help. Feel free to contact him. ------------------------------- Want to listen ad-free? Try our Graphs Course? Join Data Skeptic+ for $5 / month of $50 / year
/episode/index/show/dataskeptic/id/35967785
info_outline
Criminal Networks
03/17/2025
Criminal Networks
In this episode we talk with Justin Wang Ngai Yeung, a PhD candidate at the Network Science Institute at Northeastern University in London, who explores how network science helps uncover criminal networks. Justin is also a member of the organizing committee of the satellite conference dealing with criminal networks at the network science conference in The Netherlands in June 2025. Listeners will learn how graph-based models assist law enforcement in analyzing missing data, identifying key figures in criminal organizations, and improving intervention strategies. Key insights include the challenges of incomplete and inaccurate data in criminal network analysis, how law enforcement agencies use network dismantling techniques to disrupt organized crime, and the role of machine learning in predicting hidden connections within illicit networks. ------------------------------- Want to listen ad-free? Try our Graphs Course? Join Data Skeptic+ for $5 / month of $50 / year
/episode/index/show/dataskeptic/id/35707915
info_outline
Graph Bugs
03/10/2025
Graph Bugs
In this episode today’s guest is Celine Wüst, a master’s student at ETH Zurich specializing in secure and reliable systems, shares her work on automated software testing for graph databases. Celine shows how fuzzing—the process of automatically generating complex queries—helps uncover hidden bugs in graph database management systems like Neo4j, FalconDB, and Apache AGE. Key insights include how state-aware query generation can detect critical issues like buffer overflows and crashes, the challenges of debugging complex database behaviors, and the importance of security-focused software testing. We'll also find out which Graph DB company offers swag for finding bugs in its software and get Celine's advice about which graph DB to use. ------------------------------- Want to listen ad-free? Try our Graphs Course? Join Data Skeptic+ for $5 / month of $50 / year
/episode/index/show/dataskeptic/id/35562065
info_outline
Organizational Network Analysis
03/03/2025
Organizational Network Analysis
In this episode, Gabriel Petrescu, an organizational network analyst, discusses how network science can provide deep insights into organizational structures using OrgXO, a tool that maps companies as networks rather than rigid hierarchies. Listeners will learn how analyzing workplace collaboration networks can reveal hidden influencers, organizational bottlenecks, and engagement levels, offering a data-driven approach to improving effectiveness and resilience. Key insights include how companies can identify overburdened employees, address silos between departments, and detect vulnerabilities where too few individuals hold critical knowledge. Real-life applications range from mergers and acquisitions, where network analysis helps assess company dynamics before an acquisition, to restructuring efforts that improve workflow and team collaboration. Gabriel’s work highlights how organizations can shift from traditional hierarchical thinking to a network-based perspective, leading to smarter decision-making and more adaptable companies.
/episode/index/show/dataskeptic/id/35497205
info_outline
Organizational Networks
02/25/2025
Organizational Networks
Is it better to have your work team fully connected or sparsely connected? In this episode we'll try to answer this question and more with our guest Hiroki Sayama, a SUNY Distinguished Professor and director of the Center for Complex Systems at Binghamton University. Hiroki delves into the applications of network science in organizational structures and innovation dynamics by showing his recent work of extracting network structures from organizational charts to enable insights into decision-making and performance, He'll also cover how network connectivity impacts team creativity and innovation. Key insights include how the structure of organizational networks—such as the depth of hierarchy or proximity to leadership—can influence corporate performance and how sparse network connectivity fosters more diverse and innovative ideas than fully connected networks.
/episode/index/show/dataskeptic/id/35418970
info_outline
Networks of the Mind
02/18/2025
Networks of the Mind
A man goes into a bar… This is the beginning of a riddle that our guest, Yoed Kennet, an assistant professor at the Technion's Faculty of Data and Decision Sciences, uses to measure creativity in subjects. In our talk, Yoed speaks about how to combine cognitive science and network science to explore the complexities and decode the mysteries of the human mind. The listeners will learn how network science provides tools to map and analyze human memory, revealing how problem-solving and creativity emerge from changes in semantic memory structures. Key insights include the role of memory restructuring during moments of insight, the connection between semantic networks and creative thinking, and how understanding these processes can improve problem-solving and analogical reasoning. Real-life applications span enhancing creativity in the workplace, building tools to combat cognitive rigidity in aging, and improving learning strategies by fostering richer, more flexible mental networks. ------------------------------- Want to listen ad-free? Try our Graphs Course? Join Data Skeptic+ for $5 / month of $50 / year
/episode/index/show/dataskeptic/id/35340825
info_outline
LLMs and Graphs Synergy
02/10/2025
LLMs and Graphs Synergy
In this episode, Garima Agrawal, a senior researcher and AI consultant, brings her years of experience in data science and artificial intelligence. Listeners will learn about the evolving role of knowledge graphs in augmenting large language models (LLMs) for domain-specific tasks and how these tools can mitigate issues like hallucination in AI systems. Key insights include how LLMs can leverage knowledge graphs to improve accuracy by integrating domain expertise, reducing hallucinations, and enabling better reasoning. Real-life applications discussed range from enhancing customer support systems with efficient FAQ retrieval to creating smarter AI-driven decision-making pipelines. Garima’s work highlights how blending static knowledge representation with dynamic AI models can lead to cost-effective, scalable, and human-centered AI solutions. ------------------------------- Want to listen ad-free? Try our Graphs Course? Join Data Skeptic+ for $5 / month of $50 / year
/episode/index/show/dataskeptic/id/35224550
info_outline
A Network of Networks
02/04/2025
A Network of Networks
In this episode, Bnaya Gross, a Fulbright postdoctoral fellow at the Center for Complex Network Research at Northwestern University, explores the transformative applications of network science in fields ranging from infrastructure to medicine, by studying the interactions between networks ("a network of networks"). Listeners will learn how interdependent networks provide a framework for understanding cascading failures, such as power outages, and how these insights transfer to physical systems like superconducting materials and biological networks. Key takeaways include understanding how dependencies between networks can amplify vulnerabilities, applying these principles to create resilient infrastructure systems, and using network medicine to uncover relationships between diseases, potential drug repurposing and the process of aging. ------------------------------- Want to listen ad-free? Try our Graphs Course? Join Data Skeptic+ for $5 / month of $50 / year
/episode/index/show/dataskeptic/id/35140740
info_outline
Auditing LLMs and Twitter
01/29/2025
Auditing LLMs and Twitter
Our guests, Erwan Le Merrer and Gilles Tredan, are long-time collaborators in graph theory and distributed systems. They share their expertise on applying graph-based approaches to understanding both large language model (LLM) hallucinations and shadow banning on social media platforms. In this episode, listeners will learn how graph structures and metrics can reveal patterns in algorithmic behavior and platform moderation practices. Key insights include the use of graph theory to evaluate LLM outputs, uncovering patterns in hallucinated graphs that might hint at the underlying structure and training data of the models, and applying epidemic models to analyze the uneven spread of shadow banning on Twitter. ------------------------------- Want to listen ad-free? Try our Graphs Course? Join Data Skeptic+ for $5 / month of $50 / year
/episode/index/show/dataskeptic/id/35071170