NLP + Twitter = Discovering Bin Laden’s Death before CNN

Also, why text digestion is so much better than text generation.

A newsletter exploring businesses and strategies that use cutting edge language models to solve new problems and become more efficient.


Business Insider

According to Microsoft Research, “Language is the holy grail of Artificial Intelligence”. There’s been an explosion in Natural Language Processing (NLP) tech companies recently…

and most of them are focused on the wrong thing.

You see, ever since GPT-3 came out, everyone has been focused leveraging text generation. Yet, the last thing we need is to add more junk to the internet. In fact, we need the opposite.

Instead of text generation, we should be focusing on text summarization.

90% of important communication and/or work is done via the written word (where did I get that number? don’t ask). The efficiency of our communication is constrained by factors like the quality of someone else’s writing, our reading speed, how tired we are, our writing quality, etc. We’re bottle-necked by how quickly we can read and absorb crappy writing.

If we could figure out ways to use NLP to make the reading process easier, it could save millions of people LOADS of time. People pay for that stuff.

The technology is just getting good enough. Its time to find ways to use it.

I quickly became obsessed with this idea and it put me down a rabbit hole. Thankfully, I managed to snag a few things along the way. First I’ll discuss some examples of how these ideas are being applied right now. Second, I’ll discuss some opportunities I see in this area that are not being taken advantage of yet.

When Twitter knows about Bin Laden’s death before CNN

The first is Dataminr, a data analytics company valued at 1.6 billion dollars.

Basically, these guys ingest huge amounts of publicly available information from the internet (esp. Twitter) and look for evidence of high-impact events or risks as they start to happen.

In other words, if shit’s going down, Dataminr is one of the first people to know (except maybe Snowden’s boys).

Example? Well, they first got attention back in 2011 when they alerted Osama Bin-Laden’s death 23 minutes before major news networks confirmed it…

A lot can happen in 23 minutes if you know what to do.

It shouldn’t come as a surprise that the finance industry is one of their biggest customers. Having access to the ‘juice before other people gives you a pretty straightforward advantage on Wall Street (high frequency trading, anyone?).

Dataminr’s strategy is to be like a high tech filter. Plow through all the crud in the internet, and try to find a few nuggets of gold before someone else scoops it up.

Useful, but I think we can get a bit more creative than that.


DO NOT HIT THIS BUTTON

Share

(Every share encourages me to make more crappy blog posts and procrastinate on getting a job. So for my mother’s sake, don’t share this post!!)


Papers… papers everywhere

I just came across a new book called “Practical Natural Language Processing” by Vajjala et al. It’s basically a hand-guide for using NLP in the real world, from actual model design to finding an engineering team and serving people at scale.

What grabbed my attention was NLP in the healthcare industry. These days, patient medical information is stored in Electronic Health Records (EHRs). The movement from paper to electronic records led to an explosion in patient information, and it’s causing some issues.

Doctors are busy people. They’re overworked and constantly moving from patient to patient. The ‘doc don’t have time to sit down and re-read about all your problems. EHRs are very text heavy and dense, so it takes too much time and mental effort. This causes friction in the hospital and can even lead to improper treatment.

This is where NLP can save us loads of time, and even lives.

The book discussed a tool called HARVEST, created at Columbia University. It’s able to take unstructured information in EHR’s and summarize it into an intuitive overview of a patient’s history. It’s an excellent tool, and over 75% of participants said they would definitely use HARVEST again in the future.

Scrubs GIFs | Tenor

I don’t need to get into the details of how the product works. It’s the idea that matters. This issue of information abundance is universal. When people began moving to electronic formats, they started storing more information not because they needed it, but because they could. More information is better right? Well, not really.

There’s so many areas where summarization can reduce friction and make people’s lives easier. Having more information is not necessarily bad, just as long as we have efficient ways of processing it and interpreting it.

“Well when you say it that way...”

The last example is not a company per se, but they’re an example for what could be a great idea. It’s a GPT-3 powered website called tl;dr papers.

Academic papers are notorious for being jargon-filled word soup (gotta make sure it sounds legit, ya feel?). This website can take your incomprehensible paper abstract, and translate it into something a second grader could understand.

To try it out, I ran it on a paper I published a few years ago, titled “GPU Accelerated Non-Parametric Background Subtraction” (try saying that ten times fast).

Abstract (please don’t try to actually read this nonsense):

Accurate background subtraction is an essential tool for high level computer vision applications. However, as research continues to increase the accuracy of background subtraction algorithms, computational efficiency has often suffered as a result of increased complexity. Consequentially, many sophisticated algorithms are unable to maintain real-time speeds with increasingly high resolution video inputs. To combat this unfortunate reality, we propose to exploit the inherently parallelizable nature of background subtraction algorithms by making use of NVIDIA’s parallel computing platform known as CUDA. By using the CUDA interface to execute parallel tasks in the Graphics Processing Unit (GPU), we are able to achieve up to a two orders of magnitude speed up over traditional techniques. Moreover, the proposed GPU algorithm achieves over 8x speed over its CPU-based background subtraction implementation proposed in our previous work [1].

TL;DR output:

Background subtraction is a tool that helps to see what is in a picture, and it is very hard to do. If we can find a way to do it faster, we can use it for lots of other things.

Are you trying to say I’m stupid??

Anyway, neat website, but the idea itself has huge implications. Learning to be a better writer is learning how to write more clearly. People don’t like reading difficult text.

Hell, we hire lawyers because legal documents suck so much to read. I wonder if we can use GPT-3 to translate contracts too…

Oh wait, we can!

Here’s GPT-3’s 2nd grader translation of a standard NDA:

This means that if you tell anyone else something I told you in private, I can sue you and make you pay me lots of money. It also means that if I tell you something in private, I can sue you if you tell anyone else. It also means that if I tell you something in private and then I change my mind and want to tell everyone else about it, I can sue you if you tell anyone else first.

Text summarization is just one of the many ways the legal industry is going to be disrupted in the next few years with NLP. Soon enough, we’ll have robots for lawyers (cough, cough, donotpay.com, cough).

But, we’ll explore more of that topic later. So if you want to read more, I’m sorry but you’ll just have to subscribe, there’s no other option :/

Now, lets explore some of the opportunities that people have not taken advantage of yet.

Leads

As I said before, I think the market for text digestion is much larger than the market for text generation. Right now, everyone’s focused on using GPT-3 to generate text. However, that means that the text digestion space has much less hype, and many more opportunities.

Now, the $$ strategy here is to stay away from general use cases (like the Hemingway Translator). Its much harder and takes longer to monetize.

Instead, I think it’s better to focus on niche areas. Focus on one problem that a particular industry has. If you can solve that problem, you’ll have no issue finding customers. Your utility is clear to them.

Some ideas

1. Target document-driven industries, like healthcare or government.

If you’ve ever worked in one of these spaces, you know the headache documents can cause. If you find a way to interact with these documents more efficiently, it can save people loads of time and energy.

HARVEST is already doing this for the healthcare industry. I’m certain there’s many other places where this could be useful.

2. Focus on people who’s time is worth a lot of money.

If you save them time, you save them money.

Executives often have employees just to create document briefs for them so they don’t need to waste time digging up information themselves. If you can find a better way to condense information for busy people like corporate executives, you could make a killing.

3. Curation

This one is a bit more general, but it builds on an already popular trend on the internet. Because the internet is filled with so much junk, finding ways to curate content for people is very valuable. Typically this is done by someone who’s a domain expert or who takes loads of time manually sifting through junk to find those bits of gold. I think that NLP could be extremely useful here, similar to what Dataminr is doing, but for content.

If you can gather difficult-to-find information that others find interesting, people will want it.

How does GPT-3 fit into this?

GPT-3 is a text generation model, but as you saw before, it can be used for summarization/translation as well.

GPT-3 could work quite well for many of these applications, but its not strictly necessary or even recommended for all of them. For instance, GPT-3 cannot summarize large documents because the prompt window is limited to 2048 characters.

If your goal is to process huge amounts of information, you would probably be better off using a different model.

However, if you’re just translating a small block of text into something more understandable, then GPT-3 is a great option.

If you’re interested in a more technical deep-dive into different summarization models, let me know!

Also if you have any ideas of niche use-cases, “@” me on Twitter (@liamport9) or send me a DM. Many of the best ideas will rely on industry experience, so I’m very curious what y’all have to say.


(Apologies for the delay on this post. I was on vacation last week in Connecticut eating lobster rolls and grilled muffins. I will be continuing weekly posts)