Help needed with development of new scraper for filmdelta.se (Swedish Movie Scraper)?

Help needed with development of new scraper for filmdelta.se (Swedish Movie Scraper)? - Printable Version

+- Kodi Community Forum (https://forum.kodi.tv)
+-- Forum: Development (https://forum.kodi.tv/forumdisplay.php?fid=32)
+--- Forum: Scrapers (https://forum.kodi.tv/forumdisplay.php?fid=60)
+--- Thread: Help needed with development of new scraper for filmdelta.se (Swedish Movie Scraper)? (/showthread.php?tid=55353)

Pages: 1 2 3 4 5 6

- Daniel Malmgren - 2009-07-28

spiff Wrote:hard/impossible to say without seeing what you did

Quite understandable. That's why I put my code in pastebin. Too bad I forgot to put the link here

Here goes: http://pastebin.com/m5c9b93da

/Daniel

- spiff - 2009-07-28

not your fault. seems the scraper editor inserts newlines into the scraper. i stripped the xml code for newline (#10's, can't write it properly as the forum eats them) and it works just fine.

nicezia; you listening?

- Nicezia - 2009-07-29

Don't know why it does that , it really shouldn't, probably a linq thing, I'm switching to a new XML parser here soon (trying to write a wrapper for tinyxml as i'm not very fond of linq anymore, no matter how easy it makes handling xml there just aren't enough options to control the way it outputs info.)

in the meantime will update to fix soon

- Daniel Malmgren - 2009-07-29

Hi.
I've done some thinking. Filmdelta obviously doesn't support searches including the year (For example searching for "Terminator" gives me the correct movie whilst searching for "Terminator (1984)" doesn't). Because of this all previous versions of my scraper just threw away everything after the first "(" in the search string, which kinda seems like waste of information.

Now I've fiddled around for a bit and I'm thinking of the following solution (which I've implemented in the xml below):

1. CreateSearchUrl saves the year into buffer 9 (and doesn't empty the buffers upon completion)

2. The last regexp in GetSearchResults (ie the one that parses the list of hits) only displays the hits that contain the correct year.

This gives a very much narrowed down list of hits. My only problem is that I don't know if this is a good idea. Does it have any side effects?

XML follows:

http://pastebin.com/m24448ce7

/Daniel

edit: Oh. About the newlines. Maybe I should mention that I always run my xml through a "xmllint --format" before copying them to pastebin (since the editor just outputs a big block of text without any newlines). Don't know if this is relevant here though...

edit2: Ok. One side effect seems to be that if I for example searches for "Sagan om ringen (2001)" which should put "2001" into buffer 9, the buffer ends up containing "20" in xbmc. Can't really understand why...

edit3: Hmmm... Even when searching for "Robin Hood (1973)" buffer 9 ends up containing "20". Guess it just isn't any good idea keeping stuff in buffers between functions?

- Nicezia - 2009-07-31

Daniel Malmgren Wrote:Oh. About the newlines. Maybe I should mention that I always run my xml through a "xmllint --format" before copying them to pastebin (since the editor just outputs a big block of text without any newlines). Don't know if this is relevant here though...

ah i knew it wasn't outputting newlines, i tried for the last few days to reinvent the problem and it never did have any new lines (since the code has Disable Formatting enabled, it should all come out without any indention or new lines at all) at least til i figure some standard of indentation(and some way to implement it - since linq is pretty inflexible in this respect (thinking about just using xml text writer to write the final output as i can specify indentation and every little aspect of its output, and tinyxml's just a little bit over my head).

- Nicezia - 2009-07-31

Daniel Malmgren Wrote:1. CreateSearchUrl saves the year into buffer 9 (and doesn't empty the buffers upon completion)
...........
edit2: Ok. One side effect seems to be that if I for example searches for "Sagan om ringen (2001)" which should put "2001" into buffer 9, the buffer ends up containing "20" in xbmc. Can't really understand why...

the year goes into buffer $$2 when running CreateSearchUrl not $$1

Code:
[b]createsearchurl[/b]  for movie processs $$1 = urlencoded title $$2 = year (which is where the new urlencode button on ScraperXML Editor comes in handy)

[b]getsearchresults[/b] buffers  for movie process $$1 html

[b]getdetails[/b] for movies $$1=the html, $$2= id $$3=the url for the html

(the tutorial online currently doesn't tell the whole story, in fact i'm not telling the whole story right now either as multiple urls change the whole story, but i am not big fan of the multiple url usage, and don't want to promote it Wink

)

what you are pulling out is the 20 from %20 (which is the url encoding on the Title string)
if you want to copy the year just do a

Code:
<RegExp input="$$2 output="\1" dest="9">

     <expression>(.+)</expression>

</RegExp>

- Daniel Malmgren - 2009-08-01

Nicezia Wrote:the year goes into buffer $$2 when running CreateSearchUrl not $$1

Code:
[b]createsearchurl[/b] for movie processs $$1 = urlencoded title $$2 = year (which is where the new urlencode button on ScraperXML Editor comes in handy) [b]getsearchresults[/b] buffers for movie process $$1 html [b]getdetails[/b] for movies $$1=the html, $$2= id $$3=the url for the html

Oh. This was new to me. I thought $$1 was the only argument sent to any of the functions. So $$2 as sent to CreateSearchUrl actually is [^0-9]* from the title, right?

Well anyway, I'm not putting much effort into this right now. I've spoken to the folks behind filmdelta, and they're doing a special page for xbmc scraping, which will hopefully make our lives a lot easier. So maybe we don't need to save the year anywhere at all Wink

/Daniel

- Nicezia - 2009-08-01

The whole idea of the year is its easier to make an exact match with two correct terms to go by, say you're doing a auto-update in XBMC ... (threaded) if you have year and title its more likely to pick the right one from the search results

- Daniel Malmgren - 2009-08-01

I don't know when filmdelta are going to do their special page, so I guess I won't wait for it. I've fixed everything now so the scraper runs perfectly using the normal filmdelta pages. When (if) they do their stuff I'll take it from there. Which I guess means that this scraper is finished for now, supposing nobody here has any unexpected problems with it.

What happens next? Anyone willing to commit the scraper to svn? Spiff?

Latest scraper version

I've got a png file here for the scraper, guess it needs to follow the xml into svn...

/Daniel

- spiff - 2009-08-01

trac it please. xbmc.org/trac (for bookkeeping among other things Smile

)

- Daniel Malmgren - 2009-08-01

spiff Wrote:trac it please. xbmc.org/trac (for bookkeeping among other things )

http://trac.xbmc.org/ticket/6992

/Daniel

- mkortstiege - 2009-08-02

Added to SVN, thanks!

- Daniel Malmgren - 2009-08-02

vdrfan Wrote:Added to SVN, thanks!

Thank you!

In the future, if I make enhancements of the scraper, do I reopen the trac ticket or simply notify you (or someone else with svn commit rights) through the forum?

/Daniel

- mkortstiege - 2009-08-02

Always create a new ticket on trac and use the very latests SVN version of the scraper.

- Daniel Malmgren - 2009-08-05

vdrfan Wrote:Added to SVN, thanks!

Darn. Just realized that you sabotaged the scraper before committing it. The stuff in the last regexp in GetSearchResults that was supposed to filter out movies from the wrong year (the comparison with buffer 9) are gone from the version in svn. Why is that? Now I get a huge list of films from the wrong year...

Shocked

/Daniel