So I got tired of the existing scrapers returning incorrect results for about 1/3rd of my movies. It turns out that while the sites we scrape have lots of great data, their search engines range from inaccurate (imdb's) to completely broken (tmdb's). They both choke when your naming doesn't exactly match the official name of the movie and really don't like foreign movies.
Any real search engine can handle these cases just fine. Looking around I found that Bing has a very nice, easy to use developer API for accessing their search results. Google and Yahoo both also have APIs, but they are only for use as part of an AJAX website (Google's FAQ says they'll block you if you scrape their results). The Bing ToU allows "end-user-facing website or application".
Anyway, I edited the existing IMDB scraper to do a Bing search of "site:imdb.com movie (year)" and parse the returned XML. The actual data is still scraped from IMDB, I just changed the search part. For my collection of ~200 movies (hollywood, anime, foreign, etc) Bing got the correct imdb link 100% of the time.
This method could be used for any scraper by replacing "site:imdb.com" with "site:themoviedb.org" or something else.
Question for an XBMC admin: Bing's API requires a AppID, just like TMDB's. I signed up for one personally but would rather not release the scraper using my AppID. Would it be possible for someone @xbmc.org to sign up for an official AppID that can be used? It's an online process that takes about 5 minutes.
Once the AppID is squared away I'll release my bing_imdb.xml file, and post a little guide for how to add bing to other scrapers.
Any real search engine can handle these cases just fine. Looking around I found that Bing has a very nice, easy to use developer API for accessing their search results. Google and Yahoo both also have APIs, but they are only for use as part of an AJAX website (Google's FAQ says they'll block you if you scrape their results). The Bing ToU allows "end-user-facing website or application".
Anyway, I edited the existing IMDB scraper to do a Bing search of "site:imdb.com movie (year)" and parse the returned XML. The actual data is still scraped from IMDB, I just changed the search part. For my collection of ~200 movies (hollywood, anime, foreign, etc) Bing got the correct imdb link 100% of the time.
This method could be used for any scraper by replacing "site:imdb.com" with "site:themoviedb.org" or something else.
Question for an XBMC admin: Bing's API requires a AppID, just like TMDB's. I signed up for one personally but would rather not release the scraper using my AppID. Would it be possible for someone @xbmc.org to sign up for an official AppID that can be used? It's an online process that takes about 5 minutes.
Once the AppID is squared away I'll release my bing_imdb.xml file, and post a little guide for how to add bing to other scrapers.