loader from loading.io

Ep 6: Extracting Data from Old Documents with Rosa Lin, Founder, Tolstoy

Building Things with Machine Learning

Release Date: 10/24/2023

Ep 6: Extracting Data from Old Documents with Rosa Lin, Founder, Tolstoy show art Ep 6: Extracting Data from Old Documents with Rosa Lin, Founder, Tolstoy

Building Things with Machine Learning

Rosa Lin is the founder of Tolstoy [], which specializes in extracting data from documents. As I learned, this is a much tougher problem than traditional OCR! It requires a combination of deep learning and classic CV methods. Rosa also talks about her fascinating background as a journalist and her experience going through Y-Combinator. For more about this podcast, visit .  For the video version including visual examples of Tolstoy's work, visit .  0:26: The problems Tolstoy solves: extracting data from documents like emails, news articles, forms, and handwritten notes and then...

info_outline
Ep 5: Discovering Pharmaceuticals with Machine Learning, with Ryan Emerson of A-Alpha Bio show art Ep 5: Discovering Pharmaceuticals with Machine Learning, with Ryan Emerson of A-Alpha Bio

Building Things with Machine Learning

A true “aha” conversation! Learn how deep learning techniques from natural language processing (NLP) are applied to drug discovery, specifically, protein to protein interactions. Includes a quick and dirty primer on just enough biology to understand the training data A-Alpha Bio uses for their ML models. For more episodes, visit .  Show Notes: 0:37 - The basics of synthetic biology for machine learning practitioners 0:50 - What are proteins and why do they matter? 1:50 - A protein is a string of 20 amino acids… which means it starts looking like a Natural Language Processing...

info_outline
Ep 4: ROI from ML at Ep 4: ROI from ML at "Reasonable Scale" E-Commerce Companies with Ciro Greco

Building Things with Machine Learning

Ciro Greco has built ML systems used at many named-brand retailers. In this episode, he gives us tips on getting value out of ML at “reasonable scale” companies with NLP and information retrieval. The concept of “reasonable scale” was one he returned to, and he clearly has a very nuanced understanding of that segment and how they are different from the hyper scale tech giants. He also brings advanced ideas like embeddings from NLP towards e-commerce personalization.  For more episodes, visit .   Show Notes:  1:36: Key differences in applying ML at “reasonable scale”...

info_outline
Ep 3: Applying ML to Cybersecurity, with Yihua Liao show art Ep 3: Applying ML to Cybersecurity, with Yihua Liao

Building Things with Machine Learning

Yihua Liao is Head of Data Science at Netskope, a next-generation cybersecurity firm. Yihua talks about using both CV and NLP to create novel cybersecurity features. Yihua Liao’s PhD research was on security and machine learning, and he previously worked at Microsoft, Facebook, Uber, and his own startup. For more information about this podcast, visit .   Show Notes:  00:24 - How Netskope addresses cybersecurity. 00:57 - Netskope’s unique approach to cybersecurity through network traffic routing. 02:51 - The prior approach to cybersecurity: a focus on the physical perimeter and...

info_outline
Ep 2: Tedd Mann @ CollX show art Ep 2: Tedd Mann @ CollX

Building Things with Machine Learning

Ted tells us about applying machine learning to the field of baseball cards! 33% of Americans have trading cards, making this a very large addressable market. Learn some tips on scrappy ways to launch an app, and how similarity search powers one of the killer features of the CollX app.  Key Moments:  Building an application that works around the potential errors of an ML model (15:10). The data and ML behind his trading card valuation model, especially when recent transactions don’t exist. (18:30). Dealing with the latency inherent in ML and networking through the concept of...

info_outline
Ep 1: Tom Rikert @ Masterful AI show art Ep 1: Tom Rikert @ Masterful AI

Building Things with Machine Learning

In this episode, I interview my colleague Tom Rikert at Masterful AI. Tom is building the "AutoML 2.0" platform for computer vision. We talk about the product for the first 10 minutes, and then spend some learning about his work at MIT CSAIL, which got him into robotics and computer vision, as well as his experiences selling a startup to Google and his time as a venture capitalist at Andreessen Horowitz and Nextworld Capital.  To learn more about this podcast, visit .  For a video version of this episode, visit .

info_outline
Trailer show art Trailer

Building Things with Machine Learning

Welcome to the Building Things with Machine Learning Podcast.  Every episode, I’ll be interviewing someone who building really interesting products using machine learning.  Our focus is really on applications: Medical diagnostics Autonomous vehicles  & advanced driver assistance systems (ADAS) Geospatial analytics Media and Content analysis Manufacturing Logistics And AEC, Architecture / Engineering / Construction What you won’t get are coding tips or research papers. Although ML developers are definitely part of our audience, so are product managers and marketers and...

info_outline
 
More Episodes

Rosa Lin is the founder of Tolstoy [www.tolstoy.ai], which specializes in extracting data from documents. As I learned, this is a much tougher problem than traditional OCR! It requires a combination of deep learning and classic CV methods. Rosa also talks about her fascinating background as a journalist and her experience going through Y-Combinator.

For more about this podcast, visit www.yaoshiang.com/podcast.html

For the video version including visual examples of Tolstoy's work, visit https://www.youtube.com/watch?v=QtHEXvcGGRs&t=9s

0:26: The problems Tolstoy solves: extracting data from documents like emails, news articles, forms, and handwritten notes and then running NLP algorithms to classify and summarize. 

02:54: Typical customers: tech startups, news organizations, utilities, energy companies, legal firms, and educational institutions.

05:05: First walk-through of a use case: Digitizing articles for The Wall Street Journal (with images showing why off the shelf OCR failed).

07:19: Specifics of why OCR fails: multiple articles in a single page, columns, images, headings, and handwriting.

09:18: Training a custom model to deal with columns, with visuals showing how Tolstoy works better than Google Cloud Vision. 

11:30: A classic computer vision algorithm for identifying paragraphs.

12:30: Transfer learning with modern Convolution Neural Networks to identify images vs text.

13:38: Second walk-through of a use case: a classification task for a utility company to help find lead pipes. 

15:20: Can you spot the handwritten word “lead”? 

17:50: Tips for building products around inevitably imprecise ML models. 

19:37: Rosa’s personal journey from biology and journalism to entrepreneurship and ML.

22:49: Seeing the promise of AI in 2015 while at the World Bank and starting an AI hobbyist club.

26:25: How training in journalism translated to the skills required for journalism.

28:40: Rosa’s experience with Y-Combinator (YC W17)