2008-07-26, 15:41
I have recently started using XBMC (on a Mac) and found that while the IMDB scraper works well enough, there are many DVDs not on IMDB that are on Amazon.
[Note: While the examples below use a film title of "Soylent Green" I have manually searched IMDB using a browser to confirm other titles are definitely not listed.]
Surprisingly there is no existing Amazon scraper. As part of an effort to make one myself I started off by looking at the existing scrapers to see how they worked, and following on from this I made some initial efforts to convert the current FilmAffinity scraper to use English results rather than Spanish results (you can download a copy here if you are interested http://homepage.mac.com/jelockwood/.Publ...nityen.zip).
While I have not yet got an Amazon scraper even partially working yet, I have found some important information about the format of the various URLs that Amazon uses.
1. Amazon itself normally replaces spaces in Title searches with a plus (+) symbol, however it does seem to also work with a space (or %20).
A search URL like the following entered in a web-browser all work
and indeed also the slightly shorter
2. The URL of a result is normally a rather messy and complicated format like this
as you can see there would appear to be two different ID numbers plus a text field. However I have been able to determine that the following much simpler form of the URL also works.
Therefore we just need to extract the ID number beginning with a B (they all seem to begin with a B).
3. The thumbnail image normally has a URL of the form
and the large image a URL of the form
as you can see the ID number is totally different to anything previously used. However I have also found that the following URL produces the same large image and uses the main ID number from the original URL
or the older alternative host name
Note these forms of the URL must use a P rather than an I.
Based on all the above, would anyone care to assist by coming up with an initial Scraper by coding up the CreateSearchUrl and GetSearchResults sections? I will then try scraping the info fields.
PS. On a different topic, if one has a VIDEO_TS folder in a folder representing the name of the film one can use this folder name for IMDB scraping, however as mentioned not all the DVDs are listed on IMDB, I can see it should be possible to use an NFO file to provide at least some metadata but I am unsure of the correct naming and placement in this scenario.
e.g. /DVDs/Soylent Green/VIDEO_TS/
What should the NFO file be called and in which of the three possible folders (DVDs, Soylent Green, or VIDEO_TS) should it be placed?
[Note: While the examples below use a film title of "Soylent Green" I have manually searched IMDB using a browser to confirm other titles are definitely not listed.]
Surprisingly there is no existing Amazon scraper. As part of an effort to make one myself I started off by looking at the existing scrapers to see how they worked, and following on from this I made some initial efforts to convert the current FilmAffinity scraper to use English results rather than Spanish results (you can download a copy here if you are interested http://homepage.mac.com/jelockwood/.Publ...nityen.zip).
While I have not yet got an Amazon scraper even partially working yet, I have found some important information about the format of the various URLs that Amazon uses.
1. Amazon itself normally replaces spaces in Title searches with a plus (+) symbol, however it does seem to also work with a space (or %20).
A search URL like the following entered in a web-browser all work
Code:
http://www.amazon.com/s/ref=nb_ss_d?url=search-alias=dvd&field-keywords=soylent+green&x=0&y=0
Code:
http://www.amazon.com/s/ref=nb_ss_d?url=search-alias=dvd&field-keywords=soylent green&x=0&y=0
Code:
http://www.amazon.com/s/ref=nb_ss_d?url=search-alias=dvd&field-keywords=soylent%20green&x=0&y=0
and indeed also the slightly shorter
Code:
http://www.amazon.com/s/ref=nb_ss_d?url=search-alias=dvd&field-keywords=soylent%20green
2. The URL of a result is normally a rather messy and complicated format like this
Code:
http://www.amazon.com/Soylent-Green-John-Barclay/dp/B0016I0AJG/ref=sr_1_1?ie=UTF8&s=dvd&qid=1217077050&sr=1-1
as you can see there would appear to be two different ID numbers plus a text field. However I have been able to determine that the following much simpler form of the URL also works.
Code:
http://www.amazon.com/dp/B0016I0AJG/
Therefore we just need to extract the ID number beginning with a B (they all seem to begin with a B).
3. The thumbnail image normally has a URL of the form
Code:
http://ecx.images-amazon.com/images/I/51bU-puSlkL._SL500_AA240_.jpg
and the large image a URL of the form
Code:
http://ecx.images-amazon.com/images/I/51bU-puSlkL._SS500_.jpg
as you can see the ID number is totally different to anything previously used. However I have also found that the following URL produces the same large image and uses the main ID number from the original URL
Code:
http://ecx.images-amazon.com/images/P/B0016I0AJG.01.L.jpg
or the older alternative host name
Code:
http://images.amazon.com/images/P/B0016I0AJG.01.L.jpg
Note these forms of the URL must use a P rather than an I.
Based on all the above, would anyone care to assist by coming up with an initial Scraper by coding up the CreateSearchUrl and GetSearchResults sections? I will then try scraping the info fields.
PS. On a different topic, if one has a VIDEO_TS folder in a folder representing the name of the film one can use this folder name for IMDB scraping, however as mentioned not all the DVDs are listed on IMDB, I can see it should be possible to use an NFO file to provide at least some metadata but I am unsure of the correct naming and placement in this scenario.
e.g. /DVDs/Soylent Green/VIDEO_TS/
What should the NFO file be called and in which of the three possible folders (DVDs, Soylent Green, or VIDEO_TS) should it be placed?