2009-09-23, 06:53
giving scraper creation a go
site: getvideoartwork.com
the only way i can see to parse the data is to pick the url to grab by the first char in the title, while removing "the " from it (if it's there).
(?:the )?(.)(.*)
\1 is the first char
\2 is the rest of the name
how do i form a search url based off \1
ie.. if \1 is [0-9] results to grab are one page, if [aA] it's another
[0-9] http://getvideoartwork.com/index.php?act..._itemId=38
[aA] http://getvideoartwork.com/index.php?act..._itemId=39
[fF] http://getvideoartwork.com/index.php?act...temId=2174
it's not really a search, just pulling the page that lists all the movies starting with a number (or char)
another question, directly related, if i want it to pull more then 1 url how does that work
ie.
[dD] has 2 pages of data
http://getvideoartwork.com/index.php?act...&g2_page=1
http://getvideoartwork.com/index.php?act...&g2_page=2
here's sorta the pseudo logic
site: getvideoartwork.com
the only way i can see to parse the data is to pick the url to grab by the first char in the title, while removing "the " from it (if it's there).
(?:the )?(.)(.*)
\1 is the first char
\2 is the rest of the name
how do i form a search url based off \1
ie.. if \1 is [0-9] results to grab are one page, if [aA] it's another
[0-9] http://getvideoartwork.com/index.php?act..._itemId=38
[aA] http://getvideoartwork.com/index.php?act..._itemId=39
[fF] http://getvideoartwork.com/index.php?act...temId=2174
it's not really a search, just pulling the page that lists all the movies starting with a number (or char)
another question, directly related, if i want it to pull more then 1 url how does that work
ie.
[dD] has 2 pages of data
http://getvideoartwork.com/index.php?act...&g2_page=1
http://getvideoartwork.com/index.php?act...&g2_page=2
here's sorta the pseudo logic
Code:
getvideoartwork.com
Strip out "the"
match first char and search specifc page based on first char
(?:the )?(.)(.*)
\1 is first char (not the and not space)
\1\2 is the title
dynamic list (parse this page for each letter)
http://getvideoartwork.com/index.php?action=gallery&g2_itemId=27
regex repeat
fixed list
[0-9] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=38
[aA] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=39
[bB] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=40
[cC] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=80
[dD] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=82
[eE] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=84
[fF] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=2174
[gG] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=88
[hH] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=663
[iI] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=92
[jJ] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=125
[kK] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=127
[lL] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=129
[mM] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=131
[nN] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=133
[oO] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=147
[pP] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=149
[qQ] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=151
[rR] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=153
[sS] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=155
[tT] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=157
[uU] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=159
[vV] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=161
[wW] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=163
[xX] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=165
[yY] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=167
[zZ] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=169
from that page get the info in
get all <tr> .*? </tr>
<td>(.*?)</td>
\1 is the blocks of data to process
(regex repeat)
then process those
ID: get id from g2_itemId: <a href="index.php?action=gallery&g2_itemId=3570">
<a href="index.php.action=gallery&g2_itemId=(\d{4,7})">
\1 is the id
TITLE:
<p class="giTitle">([^<]*)</p>
\1 is the title on the page
Title: giTitle <p> (paragraph) tag
<p class="giTitle">.*?</p>
or
<p class="giTitle">[^<]*</p>
regex repeat
form the image link as
http://getvideoartwork.com/index.php?action=gallery&g2_itemId=/1&g2_imageViewsIndex=1
i.e. http://getvideoartwork.com/index.php?action=gallery&g2_itemId=3989&g2_imageViewsIndex=1
Done with inital searching, that's the results list
=======================
(page for reference) http://getvideoartwork.com/index.php?action=gallery&g2_itemId=3989&g2_imageViewsIndex=1
take those results
and get the link (includes some serial number thing)
<div id="gsImageView" class="gbBlock">
<img src="gallery/main.php?g2_view=core.DownloadItem&g2_itemId=3989&g2_serialNumber=1" alt="Dark Knight v2.jpg" height="1500" width="1000">
</div>
<img src="(gallery/main.php.g2_view=core.DownloadItem&g2_itemId=\d{4,7}&g2_serialNumber=\d{1,9})"
append http://getvideoartwork.com/ to img src for url of image
url = http://getvideoartwork.com/\1