GPT-4 Vision is Really Good at Web Scraping

April 12, 2024·2 min read

[include: LLM is function that takes anything and gives back rainbows]

[maybe start with phillip story? meeting friend at restaurant, talking about current LLM developments, say it was only 1 year ago]

What if I told you that there is a database with information about every thing on the planet? People's latest thoughts, recently launched products, [fill-out]

Not only that, what if I also told you that millions of people write to this database at every millisecond of every day.

This database is the world wide web.

diagram

All IP addresses, aka The Internet
  All names in DNS with at least one unique A record
    DNS names that resolve to servers
      Servers that return HTML
All IP addresses, aka The InternetUnique "A" records in DNSDNS names thatresolve to serversServers thatreturn HTML

Let's say I want to write a program that, for whatever reason, needs to know kj

Getting data out of a SQL database looks like:

[fill-out]

getting data out of the web might look like:

[fill - out];

Web scraping: the art of turning the world wide web into your own very slow quirky database which really doesn't want to be queried. But—it contains all of the data in the planet.

HTML is the table schema and XPath selectors are the column names.

Every table is queried in a different way, and the way one queries the same table can also change with time.

https://blog.bonner.is/using-ai-to-find-fencing-courses-in-london/