Clean scraping API - Printable Version

Clean scraping API - Printable Version

+- Kodi Community Forum (https://forum.kodi.tv)
+-- Forum: Development (https://forum.kodi.tv/forumdisplay.php?fid=32)
+--- Forum: Kodi Application (https://forum.kodi.tv/forumdisplay.php?fid=93)
+---- Forum: GSoC (https://forum.kodi.tv/forumdisplay.php?fid=299)
+----- Forum: GSoC 2012 (https://forum.kodi.tv/forumdisplay.php?fid=161)
+----- Thread: Clean scraping API (/showthread.php?tid=134012)

Pages: 1 2 3 4 5 6

Re: Clean scraping API - queeup - 2013-04-11

Sorry for interrupt. I didn't read all topic but maybe you guys want to check this for some new ideas.
https://github.com/wackou/guessit

RE: Clean scraping API - topfs2 - 2013-04-11

(2013-04-11, 16:58)queeup Wrote: Sorry for interrupt. I didn't read all topic but maybe you guys want to check this for some new ideas.
https://github.com/wackou/guessit

nice find! I bet there is tons we can borrow from that!

Re: Clean scraping API - queeup - 2013-04-11

Good, then I will add one more for video metadata.
https://github.com/Diaoul/enzyme

RE: Clean scraping API - garbear - 2013-04-11

stop making our lives easier

Re: Clean scraping API - queeup - 2013-04-11

Believe me I was waiting this python scraper thing almost two years and finally it's happening. Well done. Bad thing is I saw this topic today. Shame on me :(

RE: Clean scraping API - garbear - 2013-04-11

Wink

kickass links

RE: Clean scraping API - topfs2 - 2013-04-17

Since this thread has gotten so much heat as of lately I want to start a discussion on something I simply need some discussion on Smile

The discussion is regarding issue #7 #9 and semi related is #8.

The problem is not really the scheduling algorithms (they would need some love but in essence they should work) but more how to reorganize the API of supplies and demands.

Basically what we arrive at IMO is a subgraph find and alteration problem, which we in essence had before but with a single node (subject) and its edge.

So what I envision is something along the lines of
demands: find A where edge(A, owl.sameAs, B) and (B is URL or edge(B, dc.identifier))

As this would allow for this type of owl.sameAs

Code:
{

  owl.sameAs: [

    "http://themoviedb.org/movie/544",

    {

       dc.identifier: [ "http://www.imdb.com/title/tt0372784" ],

       foaf.thumbnail: [ "http://www.imdb.com/media/rm955554048/tt0372784?ref_=tt_ov_i" ]

    }

  ]

}

But I can't find a nice way to produce the above query in python, and in a pythonic way.

I'd love it if the demand and supply API was similair aswell, and provided some validation on the output aswell.

ATM a task can state it outputs a certain edge and nothing else but when run it can output anything Smile

This could potentially break scheduling. So I'd love it if the task missbehave heimdall is able to detect that and just throw away the result Smile

Cheers,
Tobias

RE: Clean scraping API - garbear - 2013-05-09

(2013-05-09, 19:49)The Movie Database Wrote:Searching is an important tool for a project like TMDb. Without a good search we end up with duplicates, frustrated users and quite frankly a less than stellar experience. Over the past few years we've had a lot of things change, especially with the amount of non-English content that has been added to our database. We've also grown a lot and our old search infrastructure simply wasn't up for the task.

Starting yesterday, we rolled out a completely brand new, built from scratch search that we feel very proud of. We're not saying it's going to be perfect but it's a foundation we can feel confident growing into.

Along with these improvements behind the scenes, we also added two new options to search with. 'primary_release_year' and 'search_type' are new. You can read about how these work by visiting our search documentation.

http://docs.themoviedb.apiary.io/#search

As always, if you notice any specific issues make sure to head over to our support area and let us know.

One last thing, we also released more than just a new search, as we have brought the idea behind our 2.1 "Movie.browse" method into v3 but made it considerably better. We've renamed it "discover" and it's pretty awesome. You can read more about it by visiting our API documentation.

http://docs.themoviedb.apiary.io/#discover

From their facebook page: https://www.facebook.com/themoviedb

It looks like they've been working heavily on the search issue as well. With a search engine on their end so heavily optimized in the domain of movies, I'm imagining how much thinking we're going to need to put in to actually contribute anything statistically significant to their results.

RE: Clean scraping API - TheMonkeyKing - 2013-10-18

Error results on our end. While they have the definitions developed on their end we need the application of terms. Basically we want to sort our false results and possible fixing the erroneous result so it is correct now and remembers the corrected ID. Also, to know when to search and when not to.

(2013-05-09, 23:05)garbear Wrote: It looks like they've been working heavily on the search issue as well. With a search engine on their end so heavily optimized in the domain of movies, I'm imagining how much thinking we're going to need to put in to actually contribute anything statistically significant to their results.