Cameras dedicated to video monitoring usually do not allow identify people and vehicles in city environment. The source of problem is relatively low angular resolution. They can be HD or even 4K but if object is placed far from camera and lenses are wide angle then recognition is imposible. It is possible to use megapixel cameras (e.g. 40M or more) but in such case amount of data will be huge and there will be a problem with storage for such big amount of data.
What we propose is to use megapixel camera but only selected frames will have native resolution, other frames will have lower resolution (HD or 4K for example). As a result we will have video stream composed of frames having two different resolutions and two different frame rates. Amount of data will be higher than in case of lower resolution only but significantly smaller then in case of higher resolution only.
For example we can use 30 fps frame rate, 30 frames long GOP (Group of Pictures), where I frames have megapixel resolution while other frames (B, P) have HD resolution. As a result we have one megapixel frame and 29 HD frames every second.
Encoder encode I frames with native sensor resolution, then scale them down to HD resolution and use as reference to encode B and P frames in HD resolution. Stream is composed of megapixel resolution I frames and HD resolution B and P frames. HD resolution I frames are not added to the stream because they can be obtained on decoder side by scaling down of megapixel resolution I frames.
Image shows proportion between I frames in native resolution of 40M sensor and P and B frames in HD resolution
As a result we have one video stream which contains two different video substreams inside. One substream have megapixel resolution and 1 fps frame rate and second one have HD resolution and 30 fps frame rate.
Special, dedicated decoder is necessary to decode such stream. Decoder can decode both substreams or only megapixel one. Higher resolution substream is composed of I frames only while lower resolution substream is composed of I frames scaled down to HD resolution and B and P frames, both in HD resolution.
Example view from the camera (40Mpixels). Color frames shows two selected HD resolution windows.
Content of red frame. 1 pixel of the window is equal to 1 pixel of the display
Content of green frame. 1 pixel of the window is equal to 1 pixel of the display
As you can see on above pictures 40 megapixel resolution frame contain enough information to identify peoples and vehicles. Frame rate of megapixel frames is low and do not allow to judge dynamics of events. For examply decide who did what. On the other hand HD stream have fast frame rate what allows to recognize activity of people and vehicles but have too low resolution to identify them. Together, both substreams allows to see not only what happend and who is responsible but also identify all persons and vehicles.
The idea is simple, implementation as well. Only minor changes are necessary in the codec (h264 for example) to encode and decode dual resolution streams.
More work is necessary to build hardware implementation. Hardware decoder should be equipped in multiple video outputs (e.g. HDMI). One should output whole stream scaled to the resolution of connected monitor. Others should output content of windows selected by system operator (e.g. red and green windows from above example).