2014-12-26, 10:33
The default regex for cleandatetime is really bad. To clarify some details for anyone searching for info:
CUtils::CleanStrings first pulls the <cleandatetime> regex that you can specify in advancedsettings.xml.
Only one regex string is allowed in that field.
The first group matched is determined to be the title. The second group matched is determined to be the year (and is passed in to the scraper in buffer $$2). Any additional groups matched are discarded. If the regex isn't matched at all, nothing is inserted into the year group and the entire file name string is passed on to the <cleanstrings> portion of name handling. If a match is found, everything other than the year and first group found (generally everything before the start of the year info) is discarded.
I'll list a number of possible year labels on films, and explain what happens with the default regex, and with mine (shown below). The films aren't generally real, I'm just listing different patterns.
'no match' means it will use the entirety of the provided file name, and not provide any year. Otherwise, I will show the captured title, then a slash, then the year that was determined.
My Movie
- default: no match
- mine: no match
My Movie 2004
- default: My Movie / 2004
- mine: My Movie / 2004
My Movie (2004)
- default: no match
- mine: My Movie / 2004
My_Movie_2004
- default: My_Movie / 2004
- mine: My_Movie / 2004
My Movie[2004]
- default: no match
- mine: My Movie / 2004
My TV Show (2004-2005)
- default: no match
- mine: My TV Show / 2004
My TV Show ( 2004 - 2005 )
- default: My TV Show ( 2004 / 2005
- mine: My TV Show / 2004
2001: A Space Odyssey
- default: no match
- mine: no match
2001: A Space Odyssey (1968)
- default: no match
- mine: 2001: A Space Odyssey / 1968
Knives: 2000 Ways to Kill Someone
- default: Knives: / 2000
- mine: Knives: 2000 Ways to Kill Someone
Knives: 2000 Ways to Kill Someone.2001
- default: Knives: 2000 Ways to Kill Someone / 2001
- mine: Knives: 2000 Ways to Kill Someone / 2001
Knives: 2000 Ways to Kill Someone-2001
- default: Knives: 2000 Ways to Kill Someone / 2001
- mine: Knives: 2000 Ways to Kill Someone / 2001
Knives: 2000 Ways to Kill Someone[2001]
- default: Knives: / 2000
- mine: Knives: 2000 Ways to Kill Someone / 2001
1999.S00E01
- default: no match
- mine: no match
1999.S00E01.1974
- default: 1999.S00E01 / 1974
- mine: 1999.S00E01 / 1974
1999 - S00E01 (1974)
- default: no match
- mine: 1999 - S00E01 / 1974
Umika - Sincerity [AKROSS_Con_2012]
- default: no match
- mine: Umika - Sincerity / 2012
Oasis - Falling Down (East of the Eden version)[2008][h264]
- default: Oasis - Falling Down (East of the Eden version / 2008
- mine: Oasis - Falling Down (East of the Eden version) / 2008
The 1975 Show (1975)
- default: The, 1975
- mine: The 1975 Show / 1975
The Tonight Show of 1995 (1995)
- default: The Tonight Show of / 1995
- mine: The Tonight Show of / 1995
As you can see, there are quite a few patterns that are just broken using the default regex.
The following is the regex that I've built up to handle as many different cases as feasible, from the various testing that I've been able to manage. It handles everything that I've been able to throw at it except for that last pattern, and I'm not sure there's any reasonable way to deal with that except completely disallowing dates that are only preceded by spaces (something I would not object to, but since the default allows simple spaces as delimiters, I'm allowing that in mine).
A version that doesn't allow simple spaces to be a delimiter for a year:
And for reference, here's the default regex:
Edit: Fixed the regex slightly.
CUtils::CleanStrings first pulls the <cleandatetime> regex that you can specify in advancedsettings.xml.
Only one regex string is allowed in that field.
The first group matched is determined to be the title. The second group matched is determined to be the year (and is passed in to the scraper in buffer $$2). Any additional groups matched are discarded. If the regex isn't matched at all, nothing is inserted into the year group and the entire file name string is passed on to the <cleanstrings> portion of name handling. If a match is found, everything other than the year and first group found (generally everything before the start of the year info) is discarded.
I'll list a number of possible year labels on films, and explain what happens with the default regex, and with mine (shown below). The films aren't generally real, I'm just listing different patterns.
'no match' means it will use the entirety of the provided file name, and not provide any year. Otherwise, I will show the captured title, then a slash, then the year that was determined.
My Movie
- default: no match
- mine: no match
My Movie 2004
- default: My Movie / 2004
- mine: My Movie / 2004
My Movie (2004)
- default: no match
- mine: My Movie / 2004
My_Movie_2004
- default: My_Movie / 2004
- mine: My_Movie / 2004
My Movie[2004]
- default: no match
- mine: My Movie / 2004
My TV Show (2004-2005)
- default: no match
- mine: My TV Show / 2004
My TV Show ( 2004 - 2005 )
- default: My TV Show ( 2004 / 2005
- mine: My TV Show / 2004
2001: A Space Odyssey
- default: no match
- mine: no match
2001: A Space Odyssey (1968)
- default: no match
- mine: 2001: A Space Odyssey / 1968
Knives: 2000 Ways to Kill Someone
- default: Knives: / 2000
- mine: Knives: 2000 Ways to Kill Someone
Knives: 2000 Ways to Kill Someone.2001
- default: Knives: 2000 Ways to Kill Someone / 2001
- mine: Knives: 2000 Ways to Kill Someone / 2001
Knives: 2000 Ways to Kill Someone-2001
- default: Knives: 2000 Ways to Kill Someone / 2001
- mine: Knives: 2000 Ways to Kill Someone / 2001
Knives: 2000 Ways to Kill Someone[2001]
- default: Knives: / 2000
- mine: Knives: 2000 Ways to Kill Someone / 2001
1999.S00E01
- default: no match
- mine: no match
1999.S00E01.1974
- default: 1999.S00E01 / 1974
- mine: 1999.S00E01 / 1974
1999 - S00E01 (1974)
- default: no match
- mine: 1999 - S00E01 / 1974
Umika - Sincerity [AKROSS_Con_2012]
- default: no match
- mine: Umika - Sincerity / 2012
Oasis - Falling Down (East of the Eden version)[2008][h264]
- default: Oasis - Falling Down (East of the Eden version / 2008
- mine: Oasis - Falling Down (East of the Eden version) / 2008
The 1975 Show (1975)
- default: The, 1975
- mine: The 1975 Show / 1975
The Tonight Show of 1995 (1995)
- default: The Tonight Show of / 1995
- mine: The Tonight Show of / 1995
As you can see, there are quite a few patterns that are just broken using the default regex.
The following is the regex that I've built up to handle as many different cases as feasible, from the various testing that I've been able to manage. It handles everything that I've been able to throw at it except for that last pattern, and I'm not sure there's any reasonable way to deal with that except completely disallowing dates that are only preceded by spaces (something I would not object to, but since the default allows simple spaces as delimiters, I'm allowing that in mine).
Code:
<cleandatetime>(.+?)(?:\s*(?:(?:[[({])(?:[^])}]*)(?:_|\b))|[ _.,-]\s*)((?:19|20)\d{2})(?:(?:_|\s)*-(?:_|\s)*(?:19|20)\d{2})?\b(?!(?:\s*\w)+)[^\\/]*?$</cleandatetime>
A version that doesn't allow simple spaces to be a delimiter for a year:
Code:
<cleandatetime>(.+?)(?:\s*(?:(?:[[({])(?:[^])}]*)(?:_|\b))|[_.,-]\s*)((?:19|20)\d{2})(?:(?:_|\s)*-(?:_|\s)*(?:19|20)\d{2})?\b(?!(?:\s*\w)+)[^\\/]*?$</cleandatetime>
And for reference, here's the default regex:
Code:
<cleandatetime>(.+[^ _\,\.\(\)\[\]\-])[ _\.\(\)\[\]\-]+(19[0-9][0-9]|20[0-1][0-9])([ _\,\.\(\)\[\]\-][^0-9]|$)</cleandatetime>
Edit: Fixed the regex slightly.