Greetings.
I also have many documentaries and was interetested in having a scraper for them. Regexp is somewhat familiar to me so wacking together the main parser script wasn't too hard. Few issues that i dont understand though
Currently, scrap is reporting that my scraper returns the following for
http://docuwiki.net/?title=Battleplan
Code:
<details><title>Battleplan</title>
<year>2006</year>
<plot><p>Battleplan is a military-based television documentary series examing the various military strategies used in modern warfare,
</p><p>since World War I. It is shown on the Military Channel and UKTV History.
</p><p>Each episode looks at a particular military strategy (or "battleplan") used in warfare, through two well-known historical
</p><p>examples and compares them both with the military requirements needed in order to conduct that "Battleplan". All the episodes
</p><p>use examples from modern warfare, dating from the First World War (1914–18) up to the recent Iraq War (2003).
</p></plot>
<actor><role>hosted</role><name>Eric Meyers</name></actor>
<genre>War</genre>
<episodeguide><episode><title>Blitzkrieg</title><season>1</season><epnum>1</epnum><id>1</id><plot><p>Examples used: Nazi Germany Blitzkrieg Campaign, Battle of France, Second World War and 2003 invasion of Iraq,Iraq War.
</p></plot></episode>
<episode><title>Assault From The Air</title><season>1</season><epnum>2</epnum><id>2</id><plot><p>Battle of Crete, Unternehmen Merkur, Second World War and Operation Junction City , Vietnam War.
</p></plot></episode>
<episode><title>Deception</title><season>1</season><epnum>3</epnum><id>3</id><plot><p>Examples used: Battle of Normandy, D-Day, Second World War and First Gulf War and 2003 invasion of Iraq, Iraq War.
</p></plot></episode>
<episode><title>Assault From The Sea</title><season>1</season><epnum>4</epnum><id>4</id><plot><p>Examples used: Battle of Inchon, Korean War and Battle of Iwo Jima, Pacific War, Second World War.
</p></plot></episode>
<episode><title>Counterstrike</title><season>1</season><epnum>5</epnum><id>5</id><plot><p>Examples used: Yom Kippur War and Battle of Moscow, Second World War.
</p></plot></episode>
<episode><title>Blockade</title><season>1</season><epnum>6</epnum><id>6</id><plot><p>Examples used: Second Battle of the Atlantic, Second World War and US Submarine Campaign 1943-45, Pacific War, Second World War.
</p></plot></episode>
<episode><title>Siege</title><season>1</season><epnum>7</epnum><id>7</id><plot><p>Examples used: Siege of Leningrad, Second World War and Battle of Dien Bien Phu, First Indochina War and Battle of Khe Sanh, Vietnam War.
</p></plot></episode>
<episode><title>Battlefleet</title><season>1</season><epnum>8</epnum><id>8</id><plot><p>Examples used: Battle of Midway, Pacific War, Second World War and Battle of Leyte Gulf, Pacific War, Second World War.
</p></plot></episode>
<episode><title>Pre-Emptive Strike</title><season>1</season><epnum>9</epnum><id>9</id><plot><p>Examples used: Six Day War and attack on Pearl Harbor, Pacific War, Second World War.
</p></plot></episode>
<episode><title>Control of The Air</title><season>1</season><epnum>10</epnum><id>10</id><plot><p>Examples used: Battle of Britain, Second World War and First Gulf War.
</p></plot></episode>
<episode><title>Defensive Battle</title><season>1</season><epnum>11</epnum><id>11</id><plot><p>Examples used: Hindenburg Line, Western Front, First World War and Battle of Kursk, Second World War.
</p></plot></episode>
<episode><title>Guerilla Warfare</title><season>1</season><epnum>12</epnum><id>12</id><plot><p>Examples used: Mujahideen,Soviet war in Afghanistan and National Front for the Liberation of South Vietnam, a.k.a. Vietcong, Vietnam War.
</p></plot></episode>
<episode><title>Urban Warfare</title><season>1</season><epnum>13</epnum><id>13</id><plot><p>Examples used: Tet Offensive, Vietnam War and Battle of Stalingrad.
</p></plot></episode>
<episode><title>Breaking a Fortified Line</title><season>1</season><epnum>14</epnum><id>14</id><plot><p>Examples used: Hindenburg Line,Western Front, First World War and Second Battle of El Alamein, Second World War.
</p></plot></episode>
<episode><title>Raiding Operations</title><season>1</season><epnum>15</epnum><id>15</id><plot><p>Examples used: Unternehmen Eiche, recapture of Mussolini by Otto Skorzeny, Second World War and Operation Ivory Coast, Son Tay, Vietnam War.
</p></plot></episode>
<episode><title>Strategic Bombing</title><season>1</season><epnum>16</epnum><id>16</id><plot><p>Examples used: the RAF/USAAF campaign against Nazi Germany from 1941-45, bombing of Dresden and the USAAF assault on Japan in 1944-45, Bombing of Tokyo in World War II.
</p></plot></episode>
<episode><title>Flank Attack</title><season>1</season><epnum>17</epnum><id>17</id><plot><p>Examples used: Battle of Normandy, D-Day, Second World War and First Gulf War.
</p></plot></episode>
<episode><title>Special Operations</title><season>1</season><epnum>18</epnum><id>18</id><plot><p>Examples used: French Resistance, in the Second World War and 2003 invasion of Iraq, Iraq War.
</p></plot></episode>
</episodeguide>
</details>
obviously the problem is, i have p tags all over the plot's
I've been able to strip the p tags out of the main plot with the following.
Code:
<RegExp input="$$4" output="\1" dest="6">
<expression noclean="1">((<p>[^<]+</p>)+)</expression>
</RegExp>
<RegExp input="$$6" output="<plot>\1</plot>\n" dest="5+">
<expression ></expression>
</RegExp>
But i cant figure out how to strip the p tags out of the episode plots, the episode section is generated with the following
Code:
<RegExp input="$$3" output="<episodeguide>\1</episodeguide>\n" dest="5+">
<RegExp input="$$4" output="<episode><title>\2</title><season>1</season><epnum>\1</epnum><id>\1</id><plot>\3</plot></episode>\n" dest="3">
<expression repeat="yes" noclean="1"><span class="mw-headline"> ([0-9]+)\. ([-a-zA-Z0-9 ]+) </span></h3>\n((<p>[^<]+</p>)+)</expression>
</RegExp>
<expression noclean="1" />
</RegExp>
$$4 contains all the html between the "Information" and "Screenshot" labels from the page. Anyone more experienced with xbmc scrapers have an idea how to strip those p tags from the episode plots?
The other problem with documentaries, currently i have it as a tvshow type, the problem is documentaries are usually either Name.quality-ripinfo.avi or Name.1of10.EpisodeName.quality-ripinfo.avi. How can i updated xbmc to detect 1of10 as being season 1 episode 1? 4of10 as season 1 episode 4, etc. And for single part documentaries, do i still need an episodeguide section, with a single episode?
journey