filmstarts.de scraper development - help needed

filmstarts.de scraper development - help needed - Printable Version

+- Kodi Community Forum (https://forum.kodi.tv)
+-- Forum: Development (https://forum.kodi.tv/forumdisplay.php?fid=32)
+--- Forum: Scrapers (https://forum.kodi.tv/forumdisplay.php?fid=60)
+--- Thread: filmstarts.de scraper development - help needed (/showthread.php?tid=33624)

filmstarts.de scraper development - help needed - floohh - 2008-05-28

Hi guys,

im currently developing around a http://filmstarts.de scraper. After hours I managed to get the XBMC to recognize the search results of filmstarts, but when i try to get the details page, XBMC requests a empty url. I think there is a failure in my RegExp Code

And here we go:

The Filmstarts-HTMl looks like:

[HTML]
<li><a href="/kritiken/35848-Der-Fluch-von-Darkness-Falls.html">
<img alt="" src="http://thumbs.filmstarts.de/nano/DerFluchVonDarknessFalls_poster_1.jpg">
 <img src="/designs/default/images/ratings/310er.gif" alt="Wertung: 3 / 10">


Der Fluch von Darkness Falls
Teenie-Horror </a></li>

<li><a href="/kritiken/36232-Fluch-der-Karibik.html">
<img alt="" src="http://thumbs.filmstarts.de/nano/fluchderkaribik-poster1.jpg">
 <img src="/designs/default/images/ratings/910er.gif" alt="Wertung: 9 / 10">

Fluch der Karibik
Abenteuer </a></li>

<li><a href="/kritiken/37419-Blueberry-und-der-Fluch-der-D%E4monen.html">

<img alt="Blueberry und der Fluch der Dämonen" src="/designs/default//images/no_film_small.gif" height="44" width="30">
 <img src="/designs/default/images/ratings/610er.gif" alt="Wertung: 6 / 10">

Blueberry und der Fluch der Dämonen
Fantasy-Action </a></li>
[/HTML]

and my RegExp is:

Code:
<GetSearchResults dest="3">

<RegExp input="$$5" output="<?xml version="1.0" encoding="iso-8859-1" standalone="yes"?><results>\1</results>" dest="3">

<RegExp input="$$1" output="<entity><title>\2</title><url>http://www.filmstarts.de/\1</url><id>\1</id></entity>" dest="5">

<expression repeat="yes"><a href="/kritiken/([-.%\w]+)">[^<]|[\n]<span class="t">([-%. \w]+)</span></expression>

</RegExp>

<expression noclean="1"></expression>

</RegExp>

</GetSearchResults>

(I know the entities aren't converted, but I decoded them for better understanding)

The most important line is:

Code:
<expression repeat="yes"><a href="/kritiken/([-.%\w]+)">[^<]|[\n]<span class="t">([-%. \w]+)</span></expression>

to recognize

[HTML]
<a href="/kritiken/35848-Der-Fluch-von-Darkness-Falls.html">
<img alt="" src="http://thumbs.filmstarts.de/nano/DerFluchVonDarknessFalls_poster_1.jpg">
 <img src="/designs/default/images/ratings/310er.gif" alt="Wertung: 3 / 10">


Der Fluch von Darkness Falls
[/HTML]

The only thing XBMC does is to request "/"

Can somebody may help me?

- floohh - 2008-05-28

okay i altered the term for skipping the unneccesary text, but now i only catch the first match, any idea how to solve?

Code:
<li><a href="/kritiken/([-.a-z0-9A-Z]+)">.*<span class="t">([0-9a-zA-Z .]+).*</li>

- floohh - 2008-05-29

After hours of hard work, finally it worked

- spiff - 2008-05-29

i assume your issue was that you didn't realize you are writing xml. so you need to escape special chars such as ", i.e. do "

sorry i didnt see you inquery earlier. feel free to ask again i will try to be of help when i see it Smile

- floohh - 2008-06-02

Find solution here:
Link

- spiff - 2008-06-02

great - the more the merrier. will add to svn Smile

cheers

- tatoosh - 2009-06-29

Hey,

i cant download your filmstarts.de scraper. can u give me a link?
it would be nice to use this great website.

- w00dst0ck - 2009-06-29

@Tatoosh: Mach doch mal ein Update deiner XBMC Version.
Alternativ kannst Du über http://trac.xbmc.org die aktuelle Version aus dem SVN downloaden.