Hey @
manish_khurana
Thanks a lot for your suggestion. We can take your approach into account.
According to me, your idea for making a machine learning model using audio and video features (I'm assuming video frames) can work well on detecting outros, but it might not be good for intros.
So let's assume we make a simple machine learning model for intros. It would essentially be a 2-class classifier of Intros and Not-intros. So for training data for the Class Intros, we take (video frames + audio sample) of one show, convert it to some sort of feature vector and add it to our training data. Similarly we do so for several shows, take the intro video and audio, convert to a feature vector, add it to our training data of positive class Intros. To get Negative class (non-intro) training data, we take random samples of video and audio from several tv shows. So we have (some video frames + audio) over several shows for Intro class and similarly (some video frames + audio) over several shows for Non-intro class. Now we train a classifier using this data. So according to me a classifier won't be able to distinguish well between the two sets, since the two sets are essentially both video frames + audio. Accumulating Intro training set over several shows would eventually make this non-generalizable and indistinguishable from the negative training set. The classifier would at the best overfit for the training data.
Now if we think, taking video frames into our training data for intros, would basically add not much value, as for both the classes, video frames would mostly be some scenes from the tv show itself and thus not much differentiable. Now considering if we take only the audio feature into account, this might work as the classifier might be able to tell apart songs (intro sequence) and non-song audio (rest). But this might also spur out a lot of false positives in the cases where there is a song sequence in scenes apart from intros (which can happen).
And if we assume that applying Deep Learning techniques on the acquired data might eventually work things out, then for using any of the DL techniques we'd end up collecting huge amount of training data in itself, as I don't know if transfer learning would work on any of the existing architectures for audio and video and for particularly problem statement.
Why I thought it would work on outros was because outros have distiguishable video frames (black screen with credits), and along with audio, classifier might learn better.
I thought about the above problems for training a proper classifier for video and audio input and thus didn't go with it. I tried out the fingerprinting approach and found it fast and effective. I understand your concern about the database problem. We'll try to figure something out regarding it as we progress in the project on discussion with the mentors. Also the approach in my proposal is a very rudimentary one and is bound to improve iteratively as we proceed in the project.
If you have some positive results with your classifier approach, kindly share the evaluation protocol, metrics you used and results you obtained for them. I'd be happy to try them out and use them if successful. Thanks a lot for your suggestions.
Regards.