Speech recognition, NLP, and Podcasts
How can AI help us summarize digital content for learning?
Project code repository: lexcast-pipeline
As a podcast nerd and audiophile, I consume large amounts of digital content. However, I am only limited by time and so I seek to find my best podcasts hoping I won’t have to spend 4 hours listening to them. I often find myself watching shorter clips (5-20 mins) on YouTube as opposed to the 4 hour long Joe Rogans.
While listening to my all-time favorite, Lex Fridman’s The Artificial Intelligence podcast, I thought about the podcast editor who records all these podcasts and later identifies the important clips for enthusiasts like me. The editor uses some form of human intelligence in going through the podcast and picking the best moments and editing them.
Can I develop a scalable software that can do this for potentially thousands of podcasts every week?
To develop an MVP of the idea, I chose YouTube as my podcast platform due to the rich range of APIs to extract data from. We will simplify the analysis to recognize names, places, events in history, and important concepts. The podcast is rich in these named entities as the scientific conversations rely on these topics. Grasping general concepts like “philosophy, AI, and climate change” is a much more difficult problem.
Data Engineering pipeline
Data Engineering is a process involving the planning, design, retrieval, analysis, and storage of large amounts of data. Most famously known for ETL pipelines, data engineers are an integral part of an AI company. In order to implement my idea, I required an ETL (Extract, Transform, Load) pipeline to collect the podcasts, analyze the content and store the data for future analysis or consumption in production software. Click on the image below for a great Medium article that explains ETL in detail.
Speech-to-Text: Extract
Initially, I wrote a Python script that takes a YouTube playlist as input and parses each video in the playlist. In order for a machine to process the podcast’s content, the podcasts should be easily interpretable by the machine (for NLP). Thus, I discarded the audio-video content and reduced my source data to transcripts. This is often done using speech-to-text machine learning models. However, for YouTube, a third-party solution already exists in terms of the YoutubeTranscriptsAPI.
The Github repository provides a useful range of API endpoints to get transcripts or subtitles of a video (including auto-generated subtitles). It should be noted that auto-generated subtitles result in the loss of the true content information as they are mere approximators of the words being spoken.
f1 = open(‘urls.json’)
d1 = json.load(f1)
for t in d1:
video_id = d1[t]['videoId']
try: d1[t]['transcript'] =
YouTubeTranscriptApi.get_transcript(video_id)
except: print('Some error, maybe captions are disabled
for', video_id)
This completes the Extract phase in the ETL pipeline.
Natural Language Processing: Transform
In this phase, we will apply pre-existing natural language processing models like spaCy to recognize named entities and noun chunks from the mass of a transcript text for each episode. spaCy’s Named Entity Recognition model allows us to recognize important entities mentioned in the text. No customizations were done to the model but a whole range of improvements are possible to enhance the recognition task.
The NER model recognizes a range of named entities however some of them are irrelevant to a listener as their usage depends on the context of the larger conversation. Such entities are filtered out.
filter_transcript = ['TIME', 'CARDINAL', 'ORDINAL', 'QUANTITY', 'MONEY', 'PERCENT']
In addition to the named entity recognition, I apply a basic filter on the noun chunks of the entire transcript to collect certain important noun chunks (proper nouns, nouns, and adjectives) that are representative of ideas and broad concepts.
pos_tag = ['PROPN','NOUN','ADJ']
Finally, an extra python function finds the extracted entities and the corresponding timestamp in the video so the data consumer can use it to seek through the video as required. A small example of the final result:
{{"san francisco": "GPE", "timestamp": "01:34:10"},{"russian": "NORP", "timestamp": "00:11:05"},"nouns": " conversation dimitri cto waymo autonomous driving company google self car project dimitri waymo autonomous vehicle space scale accessible autonomous vehicles passengers safety driver driver seat incredible accomplishment engineering difficult exciting artificial intelligence challenges 21st century quick mention sponsor thoughts episode trial labs company businesses machine learning real world problems app summaries books online therapy”}
This concludes the Transform phase.
Store the data in JSON: Load
The formatted results are initially held in dictionaries after the transform phase. However, with a continuous flow of data, computational load of inferencing, data has to be loaded into a database or a format that can be easily distributed. In this scenario, I chose to export it to JSON in order to serve my future purposes of building a web application that summarizes podcast playlists. The resulting data from the pipeline can be served on an API or further analyzed in a variety of ways.
I am personally interested in applying the geospatial named entities on a mapping service to enhance the learning experiences of the podcast listener. Further, a bag of words model on the concept nouns extracted from each podcast can be used to summarize the content of the system. The timestamps for each named entity can be used to create an amazing podcast experience rich with features that enhance our listening, reading, and learning experience.
The proof-of-concept data set and the pipeline code are documented at the GitHub repository.