CN113168706B

CN113168706B - Object position determination in frames of a video stream

Info

Publication number: CN113168706B
Application number: CN201880099950.0A
Authority: CN
Inventors: 阿尔弗雷多·凡盖拉; 哈拉尔德·波布洛斯; 沃洛佳·格兰恰诺夫
Original assignee: Telefonaktiebolaget LM Ericsson AB
Current assignee: Telefonaktiebolaget LM Ericsson AB
Priority date: 2018-12-05
Filing date: 2018-12-05
Publication date: 2025-04-01
Anticipated expiration: 2038-12-05
Also published as: US12067771B2; EP3891700A1; US20220027623A1; CN113168706A; WO2020114585A1

Abstract

An object localization method comprises: for at least one frame of a video stream and based on at least one parameter representing a change between a scene represented by the at least one frame and a scene represented by a reference frame of the video stream, deciding whether the determination of the position of at least one object (11, 13, 15, 17, 19) in the at least one frame is based on object detection applied to the at least one frame or on a transformation of the position of the at least one object (11, 13, 15, 17, 19) detected in the reference frame.

Description

Object position determination in frames of video streams

Technical Field

The present invention relates generally to a method, object locator, computer program product and user equipment for object position determination in frames of a video stream.

Background

Augmented Reality (AR) is a direct or indirect live view of a physical real world environment, whose objects are augmented (i.e., perceptually enriched) by computer-generated perceptual information. The superimposed perception information may be constructive, i.e. additional to the natural environment, or destructive, i.e. masking of the natural environment.

More and more AR applications for user devices (e.g., smartphones and tablet computers) have been developed to superimpose virtual objects on a real world view. The core technical challenges in these applications are:

1) Identifying real world objects and their locations on the screen is commonly denoted in the art as Object Detection (OD) or object identification;

2) Tracking an object of interest, commonly referred to in the art as Object Tracking (OT), and

3) The scene is enhanced with artificial objects, labels, or other types of perceptual information.

Previously, some of the best solutions in the field of object detection were considered to be based on Deformable Part Models (DPMs) with directional gradient Histogram (HOG) features. In the past few years, more accurate solutions based on Convolutional Neural Network (CNN) technology have been considered as the latest technology in the field of object detection. These solutions detect objects in a given frame or picture of a video stream, but require a significant amount of processing power to operate in real-time. Thus, CNNs typically run on servers equipped with modern Graphics Processing Units (GPUs) with large amounts of memory.

In some AR applications, object detection needs to run in real-time on a portable user device. A typical example is an industrial AR application, which may be a support tool for technicians to repair complex hardware systems, for example. Then, the portable user device (e.g. in the form of a handheld device or a head mounted device) comprises a camera for capturing video that is input to the object detection. If the camera of such a portable user device changes its position, object detection needs to run in almost every frame of the video stream in order to find the position of the object currently in the scene. However, due to the processing complexity of object detection and the limited processing power and power supply of portable user devices, it is often not possible to run object detection in every frame.

Conventionally, this problem has been solved by not running object detection on every frame, but periodically running object detection, and instead tracking the detected object between successive object detection runs. However, object tracking is often less accurate than object detection, and objects may be easily lost. Furthermore, object tracking cannot handle occlusion of the tracked object or detect new objects entering the scene. Furthermore, if a scene is static, for example, periodic running object detection is not computationally efficient because object tracking can easily handle such static scenes. Another problem with periodically running object detection is that if new objects enter the scene between scheduled object detection runs, these objects will not be visualized in time.

Thus, there is a need for more efficient object position determination suitable for implementation in portable user devices.

Disclosure of Invention

It is a general object to provide an object position determination suitable for implementation in a portable user device.

This and other objects are achieved by aspects of the invention and embodiments disclosed herein.

One aspect of the invention relates to a method of object localization. The method comprises deciding, for at least one frame of the video stream and based on at least one parameter representing a change between a scene represented by the at least one frame and a scene represented by a reference frame of the video stream, whether the determination of the position of the at least one object in the at least one frame is based on object detection applied to the at least one frame or based on a transformation of the position of the at least one object detected in the reference frame.

Another aspect of the invention relates to an object locator comprising processing circuitry and memory, the memory comprising instructions executable by the processing circuitry. The processing circuitry is operative to determine, for at least one frame of the video stream and based on at least one parameter representing a change between a scene represented by the at least one frame and a scene represented by a reference frame of the video stream, whether the determination of the position of the at least one object in the at least one frame is based on object detection applied to the at least one frame or based on a transformation of the position of the at least one object detected in the reference frame.

Another aspect of the invention relates to a user equipment comprising an object locator according to the above and a camera configured to record video and generate a video stream.

Yet another aspect of the invention relates to a computer program comprising instructions which, when executed by at least one processing circuit, cause the at least one processing circuit to decide, for at least one frame of a video stream and based on at least one parameter representing a change between a scene represented by the at least one frame and a scene represented by a reference frame of the video stream, whether a determination of a position of at least one object in the at least one frame is based on object detection applied to the at least one frame or based on a transformation of a position of the at least one object detected in the reference frame.

Another aspect of the invention relates to a computer program product having stored thereon a computer program comprising instructions that, when executed on a processing circuit, cause the processing circuit to decide, for at least one frame of a video stream and based on at least one parameter representing a change between a scene represented by the at least one frame and a scene represented by a reference frame of the video stream, whether a determination of a position of at least one object in the at least one frame is based on object detection applied to the at least one frame or based on a transformation of a position of the at least one object detected in the reference frame.

The present invention provides a multi-mode technique for determining the position of an object in frames of a video stream. The multi-mode technique supplements an object detection mode with a transformation mode in which the position of an object in a reference frame is transformed or projected to a position in a current frame. According to the present invention, the computational complexity in determining the position of an object in a frame is reduced by transforming the pattern, thereby enabling implementation in portable user equipment with limited computational and power resources. The multi-mode technique also enables visualization of perceived information of objects that are fully or partially occluded.

Drawings

The embodiments, together with further objects and advantages thereof, may best be understood by reference to the following description taken in conjunction with the accompanying drawings in which:

FIG. 1 is an overview of a client-server architecture with object detection in a server;

FIG. 2 is an overview of a user device including an object detector;

fig. 3 schematically shows a scene of a frame of a video stream comprising an object whose position is to be determined;

FIG. 4 schematically illustrates the scenario of FIG. 3 enhanced with bounding boxes determined in object detection;

Fig. 5 schematically shows a scene of a subsequent frame of the video stream enhanced after camera rotation and with a bounding box determined in object detection;

FIG. 6 schematically illustrates the scenario of FIG. 5 enhanced with a bounding box determined based on a transformation operation;

Fig. 7A and 7B schematically illustrate the scene of a subsequent frame of the video stream enhanced after partial occlusion of an object and with a bounding box determined based on a transform operation (fig. 7A) or based on object detection (fig. 7B);

FIG. 8 is a flow chart illustrating an object positioning method according to an embodiment;

FIG. 9 is a flow chart illustrating additional optional steps of the method shown in FIG. 8, according to an embodiment;

FIG. 10 is a flow chart illustrating additional optional steps of the method shown in FIG. 8 according to another embodiment;

FIG. 11 is a flow chart illustrating an object positioning method in accordance with various embodiments;

FIG. 12 is a flow chart illustrating additional optional steps of the method shown in FIG. 8, according to an embodiment;

FIG. 13 is a flow chart illustrating additional optional steps of the method shown in FIG. 8, according to an embodiment;

FIG. 14 is a flow chart illustrating additional optional steps of the method shown in FIG. 8, according to an embodiment;

FIG. 15 is a block diagram of an object locator according to an embodiment;

FIG. 16 is a block diagram of an object locator according to another embodiment;

FIG. 17 is a block diagram of an object locator according to another embodiment;

FIG. 18 schematically illustrates a computer program based implementation of an embodiment;

FIG. 19 is a block diagram of an object locator according to another embodiment;

FIG. 20 schematically illustrates a distributed implementation between network devices;

fig. 21 is a schematic diagram of an example of a wireless communication system having one or more cloud-based network devices, according to an embodiment;

FIG. 22 is a schematic diagram illustrating an example of a telecommunications network connected to a host computer via an intermediate network in accordance with some embodiments, an

Fig. 23 is a schematic diagram illustrating an example of a host computer communicating with user equipment via a base station over a partial wireless connection, in accordance with some embodiments.

Detailed Description

Throughout the drawings, the same reference numerals are used for similar or corresponding elements.

Generally, all terms used herein are to be interpreted according to their ordinary meaning in the relevant art, unless explicitly given and/or implied by the use of the term in the context. All references to an/the element, device, component, means, step, etc. are to be interpreted openly as referring to at least one instance of the element, device, component, means, step, etc., unless explicitly stated otherwise. The steps of any method disclosed herein do not have to be performed in the exact order disclosed, unless it has to be explicitly described as being followed or preceded by another step and/or implicitly as being followed or preceded by another step. Any feature of any embodiment disclosed herein may be applied to any other embodiment, where appropriate. Likewise, any advantages of any embodiment may apply to any other embodiment and vice versa. Other objects, features and advantages of the attached embodiments will be apparent from the description that follows.

A user equipment-server architecture for Augmented Reality (AR) is shown in fig. 1. The user device 1 represented by the portable wireless user device 1 in fig. 1 comprises or is connected to a camera 2 for capturing video and recording a video stream. Pictures or video frames (for simplicity, referred to herein as frames) of the video stream are then sent from the user device 1 to an Object Detection (OD) server 5. The frame transmission may involve sending the video stream to the OD server 5, i.e. substantially all frames of the video stream to the OD server 5. In an alternative embodiment, a separate, typically time-stamped frame is sent to the OD server 5 for object detection.

The OD server 5 comprises an object detector 4 for performing object detection on the received frame or at least a part thereof. The object detection involves detecting objects in the processed frames and determining information of the detected objects, including object position representations, detection probabilities, and object types. The object position representation, commonly referred to in the art as a bounding box, defines a region of the processed frame or a region within the processed frame. The detection probability represents a likelihood that a region of the frame or a region within the frame defined by the object position representation includes the object. The object type defines the type or class of the detected object, e.g. car, pedestrian, house, etc.

This so-called detection information (i.e. object position representation, detection probability and object type) is returned to the user equipment 1 together with an indication of which frame the performed object detection is for (e.g. in terms of the time stamp of the relevant frame). The user device 1 then uses the detection information to enhance the video presented on the screen.

The OD server 5 has access to an offline trained convolutional neural network based (CNN based) object detector and a modern Graphics Processing Unit (GPU) with a large amount of memory. Such CNNs typically include tens of millions of parameters trained offline on a large annotated dataset, such as a PASCAL VOC (EVERINGHAM et al ,"The PASCAL Visual Object Classes (VOC) challenge", International Journal of Computer Vision (2010) 88: 303–338） or ImageNet (Deng et al) ,"ImageNet: A large-scale hierarchical image database", in 2009 IEEE Conference on Computer Vision and Pattern Recognition (2009)）.

Examples of CNN-based object detectors 4 include Faster R-CNN (Ren et al ,"Faster R-CNN: Towards real-time object detection with region proposal networks", IEEE Transactions on Pattern Analysis and Machine Intelligence (2017) 39(6): 1137 – 1149）、SSD（Liu et al ,"SSD: Single shot multibox detector", Proceedings of the European Conference on Computer Vision (ECCV) (2016)） and Yolo9000 (Redmon and Farhadi, "YOLO9000: Better, faster, stronger", 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)）.

In another implementation example, as shown in fig. 2, the object detector 4 is implemented in the user equipment 1. Thus, the user device 1 then comprises both a camera 2 for capturing video and generating a video stream and an object detector 4 for processing frames of the video stream for the purpose of detecting objects therein. Thus, this implementation relaxes the need to send frames to and receive detection information from the remote OD server 5, but instead it is desirable to implement an object detector 4 that includes an offline trained object detection model used by the object detector 4 at the user device 1.

Enhanced realism finds more and more applications in portable user devices 1. A typical example is an industrial AR application, where AR constitutes a support tool for technicians to repair complex hardware systems. In such a scenario, the object detection should run in real time, typically on the portable user device 1 in real time, which places a limit on the complexity of the object detector 4. In most cases, the portability of the user device 1 and thus the camera 2 results in a large amount of movement when the technician is engaged in repairing the hardware system. However, also in these cases, objects in the video should still be accurately detected and visualized. If the camera 2 changes its position relative to the photographed hardware system, object detection should generally be run in every frame of the video stream in order to detect and classify objects in the current scene. However, this is generally not possible due to the complexity of object detection and battery limitations of the portable user device 1.

The present invention solves the above-mentioned drawbacks when implementing AR applications in a portable user equipment 1 by means of an adaptive switching between an object detection mode and a transformation mode (also called projection mode). This allows the AR application to run in real time in the portable user device 1 and enables visualization of the location of the object in real time.

Fig. 8 is a flowchart illustrating an object positioning method according to an embodiment. The method comprises a step S1, which is performed for at least one frame of the video stream, as schematically indicated by line L1. Step S1 comprises deciding, for at least one frame and based on at least one parameter representing a change between a scene represented by the at least one frame and a scene represented by a reference frame of the video stream, whether the determination of the position of the at least one object in the at least one frame is based on object detection applied to the at least one frame or on a transformation of the position of the at least one object detected in the reference frame.

Thus, according to the invention, the position of at least one object in a frame may be determined by applying object detection, i.e. a so-called object detection mode, or by transforming the position of the object, which has been determined in a previous frame of the video stream, i.e. a reference frame, to the position in the current frame, i.e. a so-called transformation mode. The decision or selection between the object detection mode and the transformation mode is based on at least one parameter representing a change in the scene from the reference frame to the current frame.

Object detection as used in the object detection mode is accurate, but is computationally intensive and power consuming. However, the computational complexity of the position transformation used in the transformation mode is relatively low. The invention thus enables computationally intensive object detection in many frames of a video stream to be replaced by a position transformation, thereby reducing the computational requirements and power consumption of implementing AR in the portable user device 1.

The reference frame is a previous frame of the video stream and more preferably a previous frame of the video stream to which object detection has been applied. Thus, an object applied to the reference frame detects at least one object in the reference frame and generates an object position representation for the at least one detected object, and typically also generates a detection probability and an object type for each detected object.

Fig. 3 shows an example of a hardware system in the form of a baseband switch 10 comprising a plurality of connectors 11, 13, 15, 17, 19 as an illustrative example of an object of a scene to be detected and enhanced by a portable user equipment. Fig. 4 shows the results of object detection applied to the scene in fig. 3, visualizing bounding boxes 21, 23, 25, 27, 29 around the detected objects 11, 13, 15, 17, 19. The bounding boxes 21, 23, 25, 27, 29 shown in fig. 4 are examples of object position representations generated in object detection applied to reference frames. Thus, the object position representation defines a corresponding region or bounding box in the reference frame, which region or bounding box comprises objects having a likelihood represented by a detection probability and having a type represented by an object type.

Thus, in an embodiment, the object position representation generated by object detection is a bounding box. Each bounding box represents four parameter values defining an area of the frame. In such an embodiment, step S1 of FIG. 8 includes deciding, based on at least one parameter, whether the determination of the bounding box defining the region in the at least one frame is based on object detection applied to the at least one frame or based on a transformation of the bounding box in the reference frame.

The bounding box may, for example, be in the form of a vector defining the coordinates of the region and the size of the region. The coordinates ofAny coordinates that allow the location of the region in the frame to be identified. The coordinates may for example represent the center of the area or represent one of the corners of the area. As an illustrative but non-limiting example, the size of the region may be defined by the width of the regionAnd height ofIs defined. Thus, in an embodiment, the bounding box may beIn the form of (a). In alternative embodiments, the bounding box may include the coordinates of the opposite corners of the region, i.e.,。

Object detection models and algorithms conventionally used in object detection in frames of video streams output bounding boxes, detection probabilities, and object types for each detected object, as previously mentioned herein. The bounding box is in most cases a rectangle or square defined by four parameters as described above. This may impose a limit when detecting objects in the case where the imaged scene is rotated as shown in fig. 5. Then the bounding box may not be aligned with any rotating objects detected in the scene. An advantage of the transformation mode is that the transformation of the bounding box applied to the object in the reference frame may involve various operations, such as rotation, rescaling, panning, etc. This means that even if the bounding box defines a rectangular area in the reference frame, the transformed bounding box in the current frame does not necessarily need to define a rectangular area when the transformation mode is applied. In sharp contrast, the transformed bounding box may define a quadrilateral region in the current frame. Illustrative, but non-limiting examples of such quadrilateral regions include trapezoids (trapezoids), isosceles trapezoids (isosceles trapezoids), parallelograms, diamonds, rhomboids, kites, rectangles, and squares. This is schematically shown in fig. 6, where the rectangular and quadratic bounding boxes 21, 23, 25, 27, 29 from fig. 4 have been transformed into diamond-shaped bounding boxes 21, 23, 25, 27, 29 shown in fig. 6.

Thus, in an embodiment, step S1 in FIG. 8 includes deciding, based on at least one parameter, whether to determine a bounding box defining a rectangular region in at least one frame based on object detection applied to the at least one frame or to determine a bounding box defining a quadrilateral region in the at least one frame based on a transformation of the bounding box in a reference frame.

The object detection used in the object detection mode may be according to any object detection algorithm implemented in the object detector. For example, object detection may be in the form of sliding window object detection, such as disclosed in Viola and Jones, "Rapid Object Detection using a Boosted Cascade of Simple Features", in Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2001, Kauai, HI, U.S.; or Fischler and Bolles, "Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography, Communications of the ACM (1981) 24(6): 381-395. Alternatively, the object detection may be in the form of CNN-based object detection, such as Faster R-CNN, SSD, or YOLO9000, as previously mentioned.

Object detection used in the object detection mode may be run by an object detector 4 implemented in the user equipment 1 as shown in fig. 2. In another embodiment, the object detection runs at a remote object detector 4 as implemented in an object detection server 5 as shown in fig. 1. In such an embodiment, the user equipment 1 sends frames that should undergo object detection to the remote object detector 4 and receives object detection information, i.e. an object position representation, therefrom and also typically receives detection probabilities and object types as previously disclosed herein. Remotely running object detection has the advantage of utilizing a powerful GPU (e.g., in the OD server 5) to run at an acceptable frame rate using computationally intensive object detection models and algorithms. The disadvantage is that frames need to be sent to the remote object detector 4 and detection information returned from it, which can extend the delivery of detection information to the user equipment 1 in case of network congestion and heavy network load.

Fig. 5 shows a rotation of a scene due to a movement of the portable user device 1 for photographing the baseband switch 10. Fig. 5 shows the result of object detection applied to a rotated scene. A significant limitation of object detection is that the bounding box determined during object detection is rectangular and thus less efficient in processing rotated scenes. In fig. 5, only one object 15 is correctly detected and visualized with the bounding box 25. Fig. 6 shows the same rotated scenario as fig. 5, but when instead applying object detection in object detection mode, the position of the object 11, 13, 15, 17, 19 is determined by transforming (see fig. 4) the position of the object 11, 13, 15, 17, 19 in the reference frame to the position in the current frame in fig. 6. As shown in fig. 6, the transformation mode can effectively handle scene rotations and still visualize bounding boxes 21, 23, 25, 27, 29 that have been transformed (e.g., rotated) relative to positions in the reference frame.

Another significant advantage of using a transformation pattern is that this pattern can handle occlusion of objects 13, 15, 17, as shown in fig. 7A. In fig. 7A, another object 30 completely obscures the object with reference numeral 15 and partially obscures the objects 13, 17. However, by transforming the positions of these objects 13, 15, 17 from the reference frame in fig. 4 to the current frame, the bounding boxes 23, 25, 27 for these objects 13, 15, 17 can still be visualized correctly. This is not possible when using object detection mode or object tracking using prior art techniques, as shown in fig. 7B. Thus, in fig. 7B, only the objects 11, 19 are correctly detected, and only a part of the objects 13, 17 are detected. Since the object 15 is hidden by the object 30, the object 15 cannot be detected. The figure also shows bounding boxes 21, 23, 27, 29, 32 for detected objects 11, 13, 17, 19, 30.

Fig. 9 is a flow chart illustrating additional optional steps of the method of fig. 8, according to an embodiment. This embodiment comprises estimating or determining a transformation matrix based on reference keypoints derived from the reference frame and keypoints derived from at least one frame in step S10. The transformation matrix defines a transformation of a position in the reference frame to a position in the at least one frame. The method then proceeds to step S1 in fig. 8. In this embodiment step S1 comprises deciding, based on at least one parameter, whether the determination of the position of the at least one object in the at least one frame is based on object detection applied to the at least one frame or on a transformation of the position of the at least one object detected in the reference frame using a transformation matrix.

Transformation matrixDefining position or location in reference framesTo a position or location in the current frameIs a transformation of (a), i.e.,。

According to different embodiments, various types of transformation matrices may be estimated in step S10. In a typical example, the transformation matrix defines a geometric transformation of the locations between frames. The geometric transformation is a function whose domain and extent are a set of points. Most commonly, the domain and range of the geometric transformation are both R ² or both R ³. The geometric transformations may be 1-1 functions, i.e. they have an inverse function. Illustrative, but non-limiting examples of geometric transformations include affine transformations, which are functions between affine spaces, preserving points, lines, and planes, thereby preserving parallelism, projective transformations, which are functions between projective spaces, preserving co-linearity, and rotational translational transformations.

In step S10, a transformation matrix is estimated based on keypoints derived from the reference frame and from at least one frame. These keypoints are very unique points or features that can be identified and preferably tracked from frame to frame in the video stream. Thus, so-called reference keypoints are derived from reference frames in the video stream. In particular embodiments, the reference keypoints are extracted from or identified in the reference frame. For example, the Shi-Tomasi algorithm may be used (Shi and Tomasi, "Good Features to Track", in 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, CVPR94, Seattle, WA, U.S.） identify reference keypoints in the reference frame.

Corresponding or matching keypoints are also derived from at least one frame. In an embodiment, reference keypoints identified in reference frames are tracked or followed in subsequent frames of the video stream until the current frame is reached. Tracking may be performed according to various keypoints, features, or object tracking algorithms (e.g., lucas-Kanade optical flow algorithms (Lucas and Kanade, "An Iterative Image Registration Technique with an Application to Stereo Vision", in Proceedings of the 7^th international joint conference on Artificial intelligence (1981) 2: 674-679, Vancouver, Canada））. In another embodiment, the keypoint identification algorithm applied to the reference frame may be reapplied, but then applied to the current frame in order to identify keypoints corresponding to reference keypoints in the reference frame.

Matching or corresponding keypoints as used herein refers to the same keypoints or features in the reference frame as in the current frame. For example, even though the location of the frame may have changed from the reference frame to the current frame, the upper left corner of the frame identified as the reference keypoint in the reference frame matches and corresponds to the upper left corner of the same frame in the current frame.

The transformation matrix may be estimated in step S10 based on reference keypoints derived from the reference frame and keypoints derived from at least one frame. Various matrix estimation methods may be used in step S10. For example, the elements of the transformation matrix may be estimated by means of Least Squares Estimation (LSE). As an example, assume thatThe matching keypoints are derived from the reference frame and the current frame:

And

Wherein, Representing x and y coordinates of a reference keypoint in a reference frameThe x and y coordinates representing the matching keypoints in the current frame,. The estimation of the transformation matrix may then include finding the best transformation of the form:

,

wherein the transformation matrix is . Obtaining a transformation matrixIs solved by LSE of (C)Given. Other algorithms and methods of estimating the transformation matrix from two sets (preferably two equalized sets, i.e. the same number of keypoints in both sets) may be used. An illustrative but non-limiting example of such another algorithm or method is RANSAC (Fischler and Bolles, "Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography, Communications of the ACM (1981) 24(6): 381-395）.

The transformation matrix then references the positions in the frameTransforming to a position in the current frame:

In an embodiment step S1 comprises deciding, based on at least one parameter derived from the transformation matrix, whether the determination of the position of the at least one object in the at least one frame is based on object detection applied to the at least one frame or on a transformation of the position of the at least one object detected in the reference frame using the transformation matrix.

Thus, in this embodiment, at least one parameter derived from the transformation matrix is used as a parameter representing a change between the scene represented by the at least one frame and the scene represented by the reference frame.

In an embodiment, the transformation matrix is parameterized as:

Wherein the method comprises the steps of

Here, the、Is a horizontal and vertical scaling factor that is used to scale the image,Is the rotation angle, and、Is a horizontal and vertical translation.

Any of these parameters (i.e., scaling factor, rotation angle, and translation), or any combination of these parameters, may be used as a basis for determining whether the position of at least one object is based on object detection or transform-based decisions in step S1 of fig. 8.

Instead of or as an alternative to using at least one parameter derived from the transformation matrix, another parameter or other parameters representing a change between the scene represented by the at least one frame and the scene represented by the reference frame may be used as a basis for the decision in step S1 of fig. 8. For example, as shown in step S20 of fig. 10, at least one parameter may be received from at least one sensor of the user equipment. In this embodiment, the user equipment 1 comprises one or more sensors 2, 3, which may be used to generate or determine at least one parameter. Illustrative, but non-limiting examples of such sensors 2, 3 include accelerometers, magnetometers, gyroscopes and cameras, see fig. 1 and 2.

For example, an operating system of a wireless communication device (e.g., a smart phone) including Android and iOS provides an Application Programming Interface (API) to obtain an approximate rotation angle of the wireless communication device. These operating systems also provide access to raw data from various sensors (e.g., cameras, accelerometers, magnetometers, and gyroscopes) that can be used to estimate the position of the wireless communication device and, thus, the translation of the wireless communication device.

For example, function getDefaultSensor (sensor_type_ ROTATION _vector) reports the orientation of an Android-running wireless communication device relative to the north-east coordinate frame. It is typically obtained by integrating accelerometer, gyroscope and magnetometer readings together. For more information see https:// source. Android. Com/devices/sensors/sensor-types # rotation_vector. Accordingly, class CMAttitude provides an orientation of wireless communication devices running iOS, see https:// development.

Still another trend is that operating systems running in wireless communication devices, such as smartphones, include simultaneous localization and mapping (SLAM) functionality. SLAM includes algorithms that estimate position and orientation from both cameras and other sensors in the wireless communication device. For example, android supports ARCore library （https://developers.google.com/ar/reference/java/com/google/ar/core/Camera#getDisplayOrientedPose()）, and iOS supports ARKit library (https:// development. Application. Com/document/arkit/arcamera).

Thus, given the position and orientation of the user device as obtained from the at least one sensor, at least one parameter representing a change between the scene represented by the at least one frame and the scene represented by the reference frame may be calculated, for example by calculating a scene rotation, a scene translation and/or a scene scaling.

Fig. 11 is a flow chart illustrating an object positioning method in accordance with various embodiments. In an embodiment, step S1 shown in fig. 8 includes steps S31, S33, S35, and S36 shown in fig. 11. In this embodiment step S31 comprises determining the position of the at least one object in the at least one frame based on a transformation of the position of the at least one object detected in the reference frame if any rotation of the scene represented by the at least one frame relative to the scene represented by the reference frame exceeds a threshold rotation.

In an embodiment, the method further comprises an optional step S30 comprising comparing the rotation of the scene with a threshold rotation. If the rotation of the scene exceeds the threshold rotation, the method continues with step S31.

Thus, if the scene represented by the current frame has rotated more than a threshold rotation relative to the scene represented by the reference frame, the position of the at least one object in the current frame is determined according to a so-called transformation pattern (i.e. a transformation based on the position of the at least one object detected in the reference frame).

Fig. 6 illustrates rotation of the scene relative to the reference frame shown in fig. 4. As is clear from fig. 6, the transformation applied to the bounding box 21, 23, 25, 27, 29 around the object 11, 13, 15, 17, 19 in step S31 may rotate and optionally translate the bounding box 21, 23, 25, 27, 29 such that it appears around the object 11, 13, 15, 17, 19 even after the scene rotation. If instead an object detection mode is used as shown in fig. 5, such scene rotation cannot be handled effectively. As shown in fig. 6, the transformation mode can effectively handle scene rotations and still visualize bounding boxes 21, 23, 25, 27, 29 that have been transformed (e.g., rotated) relative to positions in the reference frame.

If the rotation of the scene does not exceed the threshold rotation, as verified in optional step S30, the position of the at least one object in the at least one frame is determined in step S33 based on the object detection applied to the at least one frame if any reduction of the scene represented by the at least one frame relative to the scene represented by the reference frame exceeds a threshold reduction.

In an embodiment, the method further comprises an optional step S32 comprising comparing the reduction of the scene with a threshold reduction. If the reduction of the scene exceeds the threshold reduction, the method continues with step S33.

Thus, if the scene represented by the current frame represents a scaled-down version of the scene represented by the reference frame, the location of the at least one object in the current frame is determined according to the object detection mode (i.e., based on object detection applied to the current frame).

In case of a large or severe zoom-out (above a threshold zoom-out), the reason for using the object detection mode instead of the transformation mode is that by zooming out there is a high probability that new objects enter the scene and where these objects are not present in the scene represented by the reference frame. Thus, for these new objects entering the scene, there are no corresponding objects in the scene represented by the reference frame. A typical example is to have a reference frame in which the camera enlarges the left part of the baseband switch 10 in fig. 3. In this case, the only object that appears in the scene will be object 11. If the camera is then zoomed out to capture the entire baseband switch 10, new objects 13, 15, 17, 19 will appear in the scene. The position transformation between the scenes applied in the transformation mode is valid only for objects that have appeared and been detected in the scene represented by the reference frame. This means that for a heavy zoom-out operation, where it is most likely to introduce a new object, object detection should be used instead in step S33 in order to determine the position of at least one object in the current frame.

If the shrinkage of the scene does not exceed the threshold shrinkage, as verified in optional step S32, if any translation of the scene represented by the at least one frame relative to the scene represented by the reference frame exceeds the threshold translation, in step S35 the position of the at least one object in the at least one frame is determined based on the object detection applied to the at least one frame.

In an embodiment, the method further comprises an optional step S34 comprising comparing the translation of the scene with a threshold translation. If the translation of the scene exceeds the threshold translation, the method continues with step S35.

The translation of the scene may be a translation of the scene in the x-direction, a translation of the scene in the y-direction, or any direction independent translation. For example, assume that the upper left corner of baseband switch 10 corresponds to pixel (2, 9) in the reference frame and corresponds to pixel (25,17) in the current frame. In this case, the translation of the scene in the x-direction is 25-2=23 pixels, the translation of the scene in the y-direction is 17-9=8 pixels, and the usual translation may be, for exampleAnd each pixel.

In case of large translations, the reason for using the object detection mode instead of the transformation mode is substantially the same as for the zoom-out, i.e. there is a risk that new objects enter the current frame, and wherein these objects are not present or detected in the reference frame.

If the translation of the scene does not exceed the threshold translation, as verified in optional step S34, then in step S36 the position of the at least one object in the at least one frame is determined based on the transformation of the position of the at least one object detected in the reference frame.

In other words, in an embodiment, the object detection mode is used only in case of heavy zoom out and pan, while the transformation mode is used in case of heavy rotation and in all other cases where the object detection mode is not used.

The order of comparison in optional steps S30, S32 and S34 may be changed, for example, in any of S30, S34 and S32, S30 and S34, S32, S34 and S30, S34, S30 and S32, or S34, S32 and S30.

Fig. 12 is a flow chart illustrating additional optional steps of the method shown in fig. 8, according to an embodiment. This embodiment includes comparing at least one parameter to a corresponding threshold. The method then proceeds to step S1 in fig. 8. In this embodiment step S1 comprises deciding, based on the comparison, whether the determination of the position of the at least one object in the at least one frame is based on object detection applied to the at least one frame or on a transformation of the position of the at least one object detected in the reference frame.

As shown in fig. 11, step S40 may be performed according to any one of steps S30, S32, and S34.

In an embodiment, the at least one parameter comprises a rotation angle. In this embodiment, the method includes, ifThen in step S31 of fig. 11 the position of the at least one object in the at least one frame is determined based on the transformation of the position of the at least one object detected in the reference frame, otherwise in step S33 or S35 the position of the at least one object in the at least one frame is determined based on the object detection applied to the at least one frame. In this embodiment of the present invention, in one embodiment,Is a threshold, such as the threshold rotation previously mentioned.

In an alternative or additional embodiment, the at least one parameter includes a horizontal scaling factorVertical scaling factor. In this embodiment, the method includes, ifThen in step S33 the position of the at least one object in the at least one frame is determined based on the object detection applied to the at least one frame, otherwise in step S36 the position of the at least one object in the at least one frame is determined based on a transformation of the position of the at least one object detected in the reference frame. In this embodiment of the present invention, in one embodiment,Is a threshold.

In this embodiment, a low value of the scaling factor represents a severe reduction.

In alternative or additional embodiments, the at least one parameter includes horizontal translationVertical translation. In this embodiment, the method includes, ifThen in step S35 the position of the at least one object in the at least one frame is determined based on the object detection applied to the at least one frame, otherwise in step S36 the position of the at least one object in the at least one frame is determined based on a transformation of the position of the at least one object detected in the reference frame. In this embodiment of the present invention, in one embodiment,Is a threshold, such as the previously mentioned threshold translation.

Therefore, as described above, any of the above parameter examples may be used alone in deciding or selecting whether to use the object detection mode or the conversion mode. In another embodiment, at least two of the above parameters may be used to decide or select whether to use the object detection mode or the transformation mode, such as rotation and reduction, rotation and translation, or reduction and translation, or all three parameters may be used to decide or select whether to use the object detection mode or the transformation mode.

In this latter case, the at least one parameter includes a horizontal scaling factorVertical scaling factorAngle of rotationHorizontal translationVertical translation. The method then includes, ifThen in step S31 the position of the at least one object in the at least one frame is determined based on the transformation of the position of the at least one object detected in the reference frame. However, ifThe method includes, ifOr (b)The position of the at least one object in the at least one frame is determined based on the object detection applied to the at least one frame in step S33 or S35, otherwise in step S36 the position of the at least one object in the at least one frame is determined based on a transformation of the position of the at least one object detected in the reference frame.

In an embodiment, the method comprises additional steps S50 and S51 as shown in fig. 13. Step S50 includes comparing a time parameter representing a time period from a reference frame to at least one frame in the video streamAnd threshold valueA comparison is made. If it isThe method continues with step S51. Step S51 includes determining a position of at least one object in at least one frame based on object detection applied to the at least one frame. However, ifThe method continues with step S1 in fig. 8.

In another embodiment, the time parameter is used with the scaling factor and the translation when deciding or selecting whether to use the object detection mode or the transformation mode. In this embodiment, the method includes, ifThen in step S31 the position of the at least one object in the at least one frame is determined based on the transformation of the position of the at least one object detected in the reference frame. However, ifThe method includes, ifOr (b)Or (b)The position of the at least one object in the at least one frame is determined based on the object detection applied to the at least one frame in step S33 or S35, otherwise in step S36 the position of the at least one object in the at least one frame is determined based on a transformation of the position of the at least one object detected in the reference frame.

Thus, this embodiment introduces an initial check or criterion before deciding whether the determination of the position of the at least one object is according to the object detection mode or according to the transformation mode. The initial check verifies that the reference frame is still the most recent, i.e. that a not too long time has elapsed from the reference frame in the video stream to the current frame in the video stream. For example, if the frame number of the reference frame in the video stream is 5 and the frame number of the current frame is 305, the reference frame may not be a good reference for any object in the current frame because the scene is likely to change significantly during these 300 frames. In this case, instead, object detection is preferably applied to the current frame in order to determine the position of any object.

The time parameter may represent time in seconds, e.g.Second. In another example, the time parameter indicates the number of frames, e.g.,And a number of frames. These examples are equivalent in that given a frame number and a frame rate of a video stream, differences in frames can be converted to time, e.g., 300 frames representing 10 seconds of video using a frame rate of 30 fps. The frame rate may also be used to convert time into the number of frames.

In an embodiment, if object detection is applied to at least one frame, for example, in step S33 or S35 in fig. 11, the at least one frame may be used as a reference frame for a subsequent frame or a subsequent frame of the video stream.

In an embodiment, the reference frame is the most recent frame in the video stream to which object detection has been applied to determine the position of the at least one object. Thus, if the current frame of the video stream has a frame number j, in this embodiment the reference frame is the frame to which object detection has been applied and has a frame number j-k, where k is a number as low as possible. Although it is generally preferable to use the most recent frame to which object detection has been applied as a reference frame for any subsequent frame in the video stream, embodiments are not limited thereto. This means that instead of the most recent previous reference frame, another frame to which object detection has been applied may be used as the reference frame, i.e. a frame with frame number j-l is used instead of a frame with frame number j-k, where l > k, and frames with frame numbers j-k and j-l each contain at least one object detected using object detection, for example in step S35 or step S33 in fig. 11. This latter embodiment uses the above time parameters in combinationThis time parameter is particularly useful to ensure that no outdated previous frame is used as a reference frame for the current frame.

Fig. 14 is a flow chart illustrating additional optional steps of the method shown in fig. 8. Step S2 comprises enhancing the at least one frame with perceptual information based on a position of the at least one object in the at least one frame. The enhanced at least one frame may then be output for display on a screen of the user device.

Thus, by using the position of any object determined according to embodiments based on the object detection mode or the transformation mode, at least one frame may be enhanced with perceptual information based on the position of the object.

In particular embodiments, the type of perceptual information used to enhance the at least one frame may be selected based on the type of object determined in the object detection.

Perceptual information, as used herein, relates to any information or data that may be used to enhance a scene. Non-limiting but illustrative examples of such sensory information include the name of a detected building, the name of a detected person, and the like.

Examples of the perception information may be bounding boxes 21, 23, 25, 27, 29 around the object as shown in fig. 3 to 7B. The visualized bounding box 21, 23, 25, 27, 29 may optionally be supplemented with information or identifiers of the objects 11, 13, 15, 17, 19 enclosed by the bounding box 21, 23, 25, 27, 29. The information or identifier may, for example, identify the name or type of connector in the baseband switch 10.

In a particular embodiment, the proposed method comprises four main steps:

A. object detection is run in a reference frame to obtain a bounding box of the object of interest.

B. projection-a sparse point-to-point matching is used to find the transformation matrix H that projects the reference frame (e.g., the last frame subjected to the object detection step a) to the current frame. The bounding box detected in the reference frame is projected to the current frame using the transformation matrix.

C. Frame estimation, estimating parameters in the transformation matrix or from the sensor, and deciding when to switch between the object detection step a and the projection step B.

D. The detected or projected bounding box is visualized in the current frame and rendered on the screen.

In this particular embodiment, as an illustrative example, the object detection step may be performed by sliding window object detection or CNN-based object detection. The object detection takes frames as input and outputs a bounding box of the detected object. The projection step is based on an estimation of the transformation matrix. In this particular embodiment, the transformation matrix is obtained by first extracting very unique points (i.e., keypoints) from the reference frame and tracking them in subsequent frames. Next, a transformation matrix is estimated from the matched keypoints extracted from the reference frame and tracked in the current frame.

In an embodiment, the four main steps of the proposed algorithm may be implemented according to the following operations:

1. A reference frame is acquired from a video source (e.g., a camera of a user device) as a current frame.

2. Object detection is run on the current frame to find objects and bounding boxes.

3. Reference keypoints are extracted from the current frame, for example using the Shi-Tomasi algorithm.

4. Drawing a bounding box and rendering the enhanced frame on a screen of the user device.

5. For each subsequent frame, performing:

a. The current location of the keypoint is found, for example, using the Lucas-Kanade algorithm.

B. The transformation matrix is estimated using the reference keypoints from step 3 and their current positions, e.g. using LSE or RANSAC.

C. computing from the transformation matrix and/or the sensor data、、、And。

D. If it isGo to step 5f, otherwise proceed to step 5e. Here the number of the elements is the number,Is a predefined angular threshold, for example a value of 5 °.

E. If it isThen a severe shrinkage/severe shift or elapsed time has occurredIf the allowable threshold has been exceeded, the process goes to step 2, otherwise the process continues to step 5f. Here the number of the elements is the number,AndIs a threshold. The values of these thresholds are preferably proportional to the size of the smallest object detected.Is a predetermined interval, for example equal to 3s.

F. the bounding box in step 2 is transformed or projected into the current frame using the transformation matrix.

G. drawing a bounding box and rendering the enhanced frame on a screen of the user device.

In another embodiment of this particular algorithm, the elapsed time is performedAnd a threshold value thereofThe check in between is as a separate step between step 5c and step 5 d.

Another aspect of an embodiment relates to an object locator comprising processing circuitry and memory, the memory comprising instructions executable by the processing circuitry. The processing circuitry is operative to determine, for at least one frame of the video stream and based on at least one parameter representing a change between a scene represented by the at least one frame and a scene represented by a reference frame of the video stream, whether the determination of the position of the at least one object in the at least one frame is based on object detection applied to the at least one frame or based on a transformation of the position of the at least one object detected in the reference frame.

In an embodiment, the processing circuit is operative to estimate the transformation matrix based on reference keypoints derived from the reference frame and keypoints derived from at least one frame. The transformation matrix defines a transformation of a position in the reference frame to a position in the at least one frame. In this embodiment the processing circuitry is further operative to decide, based on the at least one parameter, whether the determination of the position of the at least one object in the at least one frame is based on object detection applied to the at least one frame or on a transformation of the position of the at least one object detected in the reference frame using a transformation matrix.

In an embodiment the processing circuit is further operative to decide, based on at least one parameter derived from the transformation matrix, whether the determination of the position of the at least one object in the at least one frame is based on object detection applied to the at least one frame or on a transformation of the position of the at least one object detected in the reference frame using the transformation matrix.

In an embodiment, the processing circuitry is operative to receive the at least one parameter from at least one sensor of the user equipment.

In an embodiment the processing circuitry is operative to determine the position of the at least one object in the at least one frame based on a transformation of the position of the at least one object detected in the reference frame if any rotation of the scene represented by the at least one frame relative to the scene represented by the reference frame exceeds a threshold rotation, and to determine the position of the at least one object in the at least one frame based on a transformation of the position of the at least one object detected in the reference frame if any reduction of the scene represented by the at least one frame relative to the scene represented by the reference frame exceeds a threshold reduction, otherwise to determine the position of the at least one object in the at least one frame based on a transformation of the position of the at least one object detected in the reference frame if any translation of the scene represented by the at least one frame relative to the scene represented by the reference frame exceeds a threshold translation.

In an embodiment, the processing circuit is operative to compare the at least one parameter to a respective threshold. In this embodiment the processing circuitry is further operative to decide, based on the comparison, whether the determination of the position of the at least one object in the at least one frame is based on object detection applied to the at least one frame or on a transformation of the position of the at least one object detected in the reference frame.

In an embodiment, the at least one parameter includes a rotation angle. In this embodiment, the processing circuitry is operative ifThe position of the at least one object in the at least one frame is determined based on a transformation of the position of the at least one object detected in the reference frame, otherwise the position of the at least one object in the at least one frame is determined based on object detection applied to the at least one frame.Is a threshold.

In an embodiment, the at least one parameter includes a horizontal scaling factorVertical scaling factor. In this embodiment, the processing circuitry is operative ifThe position of the at least one object in the at least one frame is determined based on object detection applied to the at least one frame, otherwise the position of the at least one object in the at least one frame is determined based on a transformation of the position of the at least one object detected in the reference frame.Is a threshold.

In an embodiment, the at least one parameter includes horizontal translationVertical translation. In this embodiment, the processing circuitry is operative ifThe position of the at least one object in the at least one frame is determined based on object detection applied to the at least one frame, otherwise the position of the at least one object in the at least one frame is determined based on a transformation of the position of the at least one object detected in the reference frame.Is a threshold.

In an embodiment, the at least one parameter includes a horizontal scaling factorVertical scaling factorAngle of rotationHorizontal translationVertical translation. In this embodiment, the processing circuitry is operative ifThe position of the at least one object in the at least one frame is determined based on a transformation of the position of the at least one object detected in the reference frame. Otherwise, ifOr (b)The position of the at least one object in the at least one frame is determined based on object detection applied to the at least one frame, otherwise the position of the at least one object in the at least one frame is determined based on a transformation of the position of the at least one object detected in the reference frame.

In an embodiment, the processing circuit is operative to compare a time parameter representing a time period in the video stream from the reference frame to the at least one frameAnd threshold valueA comparison is made. In this embodiment, the processing circuitry is operative ifDetermining a position of the at least one object in the at least one frame based on object detection applied to the at least one frame, otherwise deciding based on the at least one parameter whether the determination of the position of the at least one object in the at least one frame is based on object detection applied to the at least one frame or based on a transformation of the position of the at least one object detected in the reference frame.

In an embodiment, the processing circuitry is operative in this embodiment to, ifDetermining the position of the at least one object in the at least one frame based on a transformation of the position of the at least one object detected in the reference frame, otherwise, ifOr (b)Or (b)The position of the at least one object in the at least one frame is determined based on object detection applied to the at least one frame, otherwise the position of the at least one object in the at least one frame is determined based on a transformation of the position of the at least one object detected in the reference frame.

In an embodiment the processing circuitry is operative to decide, based on the at least one parameter, whether the determination of the bounding box defining the region in the at least one frame is based on object detection applied to the at least one frame or on a transformation of the bounding box in the reference frame.

In an embodiment the processing circuitry is operative to decide, based on the at least one parameter, whether to determine a bounding box defining a rectangular area in the at least one frame based on object detection applied to the at least one frame or to determine a bounding box defining a quadrilateral area in the at least one frame based on a transformation of the bounding box in the reference frame.

In an embodiment, the processing circuitry is operative to enhance the at least one frame using perceptual information based on a position of the at least one object in the at least one frame.

Another aspect of the embodiments relates to an object locator configured to decide, for at least one frame of a video stream and based on at least one parameter representing a change between a scene represented by the at least one frame and a scene represented by a reference frame of the video stream, whether a determination of a position of at least one object in the at least one frame is based on object detection applied to the at least one frame or based on a transformation of a position of at least one object detected in the reference frame.

It is to be understood that the methods, method steps and devices, device functions described herein may be implemented, combined and rearranged in various ways.

For example, embodiments may be implemented in hardware, or in software executed by suitable processing circuitry, or a combination thereof.

The steps, functions, procedures, modules and/or blocks described herein may be implemented in hardware using any conventional techniques, such as using discrete circuitry or integrated circuit technology, including both general purpose electronic circuitry and special purpose circuitry.

Alternatively, or in addition, at least some of the steps, functions, procedures, modules and/or blocks described herein may be implemented in software, such as a computer program executed by suitable processing circuitry (e.g., one or more processors or processing units).

Examples of processing circuitry include, but are not limited to, one or more microprocessors, one or more Digital Signal Processors (DSPs), one or more Central Processing Units (CPUs), video acceleration hardware, and/or any suitable programmable logic circuitry, such as one or more Field Programmable Gate Arrays (FPGAs) or one or more Programmable Logic Controllers (PLCs).

It should also be appreciated that the general processing capabilities of any conventional device or unit implementing the proposed technique can be re-used. Existing software may also be reused, for example, by reprogramming the existing software or by adding new software components.

Fig. 15 is a schematic block diagram illustrating an example of an object locator 100 according to an embodiment. In this particular example, object locator 100 includes processing circuitry 101 (e.g., a processor) and memory 102. Memory 102 includes instructions capable of being executed by processing circuitry 101.

Optionally, the object locator 100 may also include communication circuitry, represented by a corresponding input/output (I/O) unit 103 in FIG. 15. The I/O unit 103 may include functionality for wired and/or wireless communication with other devices, servers, and/or network nodes in a wired or wireless communication network. In a particular example, the I/O unit 103 may be based on radio circuitry for communicating (including transmitting and/or receiving information) with one or more other nodes. The I/O unit 103 may be interconnected to the processing circuit 101 and/or the memory 102. By way of example, I/O unit 103 may comprise any one of a receiver, a transmitter, a transceiver, I/O circuitry, an input port, and/or an output port.

Fig. 16 is a schematic block diagram illustrating an object locator 110 implemented based on hardware circuitry according to an embodiment. Specific examples of suitable hardware circuitry include one or more suitably configured or possibly reconfigurable electronic circuits, e.g., application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs), or any other hardware logic, such as circuitry based on interconnected discrete logic gates and/or flip-flops to perform specialized functions in conjunction with suitable Registers (REGs) and/or memory cells (MEM).

Fig. 17 is a schematic block diagram illustrating yet another example of an object locator based on a combination of processing circuit(s) 122, 123 and hardware circuits 124, 125 and in combination with a suitable memory unit 121. Thus, the overall functionality is divided between programming software for execution on one or more processing circuits 122, 123 and one or more pre-configured or possibly reconfigurable hardware circuits 124, 125. The actual hardware-software partitioning may be determined by the system designer based on a number of factors, including processing speed, implementation costs, and other requirements.

Fig. 18 is a computer program based implementation of object locator 200 according to an embodiment. In this particular example, at least some of the steps, functions, procedures, modules, and/or blocks described herein are implemented with a computer program 240, where the computer program 240 is loaded into the memory 220 for execution by processing circuitry including one or more processing circuits 210. The processing circuit(s) 210 and memory 220 are interconnected with each other to enable normal software execution. The optional I/O unit 230 may also be interconnected to the processing circuit(s) 210 and/or the memory 220 to enable input and/or output of relevant data (e.g., frame and detection information).

The term "processing circuit" shall be construed in a generic sense as any circuit, system, or device capable of executing program code or computer program instructions to perform specific processing, determining, or computing tasks.

Accordingly, the processing circuitry 210 is configured to perform well-defined processing tasks such as those described herein when the computer program 240 is executed.

The processing circuitry need not be dedicated to performing only the steps, functions, procedures, and/or blocks described above, but may also perform other tasks.

In an embodiment, the computer program 240 comprises instructions that, when executed by the at least one processing circuit 210, cause the at least one processing circuit 210 to decide, for at least one frame of the video stream and based on at least one parameter representing a change between a scene represented by the at least one frame and a scene represented by a reference frame of the video stream, whether the determination of the position of the at least one object in the at least one frame is based on object detection applied to the at least one frame or based on a transformation of the position of the at least one object detected in the reference frame.

The proposed technology also provides a carrier 250 comprising the computer program 240, also referred to as computer program product. Carrier 250 is one of an electronic signal, an optical signal, an electromagnetic signal, a magnetic signal, an electrical signal, a radio signal, a microwave signal, or a computer readable storage medium.

By way of example, software or computer program 240 is stored on a computer readable storage medium (specifically, a non-volatile medium) such as memory 220. The computer readable medium may include one or more removable or non-removable memory devices including, but not limited to, read-only memory (ROM), random-access memory (RAM), compact Disk (CD), digital Versatile Disk (DVD), blu-ray disk, universal Serial Bus (USB) memory, hard Disk Drive (HDD) storage, flash memory, magnetic tape, or any other conventional memory device. The computer program 240 may thus be loaded into the operation memory 220 for execution by the processing circuit 210.

The computer program product 250 has stored thereon a computer program 240 comprising instructions that, when executed on the processing circuit 201, cause the processing circuit to decide, for at least one frame of a video stream and based on at least one parameter representing a change between a scene represented by the at least one frame and a scene represented by a reference frame of the video stream, whether the determination of the position of at least one object in the at least one frame is based on object detection applied to the at least one frame or based on a transformation of the position of the at least one object detected in the reference frame.

The flowcharts described herein may be considered to be computer flowcharts when executed by one or more processors. A corresponding device may be defined as a set of functional modules, wherein each step performed by the processor corresponds to a functional module. In this case, the functional modules are implemented as computer programs running on the processor.

Accordingly, a computer program residing in memory may be organized as a suitable functional module that, when executed by a processor, performs at least a portion of the steps and/or tasks described herein.

Fig. 19 is a block diagram of the object locator 130. The object locator 130 comprises a decision module 131 for deciding, for at least one frame of the video stream and based on at least one parameter representing a change between a scene represented by the at least one frame and a scene represented by a reference frame of the video stream, whether the determination of the position of the at least one object in the at least one frame is based on object detection applied to the at least one frame or based on a transformation of the position of the at least one object detected in the reference frame.

Another aspect relates to a user equipment 1, see fig. 1 and 2, comprising an object locator according to the invention, e.g. as described in connection with any of fig. 15 to 19. In an embodiment, the user device 1 further comprises or is connected to a camera 2 configured to record video and generate a video stream. In an embodiment, the user equipment 1 further comprises or is connected to at least one sensor 2,3 configured to generate at least one parameter.

In an embodiment, the user device is selected from the group consisting of a computer, a laptop computer, a smart phone, a mobile phone, a tablet computer, a multimedia player, a set top box, and a game console.

It is also becoming increasingly common to provide computing services (hardware and/or software) in network devices such as network nodes and/or servers, where resources are provided as services to remote locations over a network. By way of example, this means that the functionality as described herein may be distributed or relocated to one or more separate physical nodes or servers. The functionality may be relocated or distributed to one or more cooperating physical and/or virtual machines, which may be located at separate physical nodes, i.e. in a so-called cloud. This is sometimes referred to as cloud computing, which is a model that supports on-demand network access to a pool of configurable computing resources such as networks, servers, storage devices, applications, and general or custom services.

There are different forms of virtualization that are useful in this context, including one or more of the following:

Integrating network functions into virtualized software running on custom or generic hardware. This is sometimes referred to as network function virtualization.

Co-locating one or more application stacks (including operating systems) running on separate hardware on a single hardware platform. This is sometimes referred to as system virtualization or platform virtualization.

Co-location of hardware and/or software resources with the aim of achieving improved system resource utilization using some advanced domain-level scheduling and coordination techniques. This is sometimes referred to as resource virtualization, or centralized and coordinated resource pooling.

While it is often desirable to concentrate functionality into so-called general purpose data centers, it may be advantageous in other scenarios to actually distribute functionality over different parts of the network.

A network device may generally be considered an electronic device that is communicatively connected to other electronic devices in a network. As an example, the network device may be implemented in hardware, software, or a combination thereof. For example, the network device may be a private network device or a general-purpose network device or a mixture thereof.

The special purpose network device may execute software using custom processing circuitry and a proprietary Operating System (OS) to provide one or more of the features or functions disclosed herein.

The general-purpose network device may execute software using a common off-the-shelf (COTS) processor and a standard OS, which is configured to provide one or more of the features or functions disclosed herein.

As an example, a dedicated network device may include hardware including processing or computing resources, typically including a set of one or more processors, a physical Network Interface (NI), sometimes referred to as a physical port, and a non-transitory machine-readable storage medium having software stored thereon. The physical NI may be regarded as hardware in a network device for making network connections, e.g. by a Wireless Network Interface Controller (WNIC) wirelessly or by plugging a cable into a physical port connected to the Network Interface Controller (NIC). During operation, software may be executed by hardware to instantiate a set of one or more software instances. Each software instance and the portion of hardware executing the software instance may form a separate virtual network element.

As another example, a general-purpose network device may include, for example, hardware including a set of one or more processors (typically COTS processors) and a NIC and a non-transitory machine-readable storage medium having software stored thereon. During operation, the processor executes software to instantiate one or more sets of one or more applications. While one embodiment does not implement virtualization, alternative embodiments may use different forms of virtualization-represented, for example, by a virtualization layer and a software container. For example, one such alternative embodiment implements operating system level virtualization, in which case the virtualization layer represents an operating system kernel or pad (shim) that allows creation of multiple software containers, each of which may be used to execute one of a set of applications, or executing on a base operating system. In an example embodiment, each software container (also referred to as a virtualization engine, virtual private server, or space (jail)) is a user space instance (typically a virtual storage space). These user space instances may be separate from each other and from the kernel space executing the operating system. Unless explicitly allowed, a set of applications running in a given user space cannot access the memory of other processes. Another such alternative embodiment implements full virtualization, in which case 1) the virtualization layer represents a hypervisor (sometimes referred to as a Virtual Machine Monitor (VMM)) or the hypervisor executes on top of the host operating system, and 2) each software container represents a tightly (tightly) isolated form of the software container, called a virtual machine, that is executed by the hypervisor and may include a guest operating system.

According to yet another embodiment, a hybrid network device is provided that includes both custom processing circuitry/proprietary OS and COTS processor/standard OS in the network device (e.g., a card or circuit board within the network device). In some embodiments of such hybrid network devices, a platform Virtual Machine (VM) (a VM that implements the functionality of a dedicated network device) may provide paravirtualization to hardware present in the hybrid network device.

Fig. 20 is a schematic diagram showing an example of how functions may be generally distributed or partitioned among different network devices. In this example, there are at least two separate but interconnected network devices 300, 310 that may divide different functions or portions of the same function between the network devices 300 and 310. There may be additional network devices 320 that are part of such a distributed implementation. The network devices 300, 310, 320 may be part of the same wireless or wired communication system, or one or more network devices may be so-called cloud-based network devices located outside the wireless or wired communication system.

As used herein, the term "network device" may refer to any device connected to a communication network, including but not limited to devices in an access network, a core network, and similar network structures. The term network device may also include cloud-based network devices.

Fig. 21 is a schematic diagram showing an example of a wireless communication system including a Radio Access Network (RAN) 41 and a core network 42 in cooperation with one or more cloud-based network devices 300 and an optional Operation Support System (OSS) 43. The figure also shows a user equipment 1 connected to a RAN 41 and capable of wireless communication with a RAN node 40, e.g. a network node, a base station, a Node B (NB), an evolved node B (eNB), a next generation node B (gNB), etc.

The network device 300, shown in fig. 21 as a cloud-based network device 300, may alternatively be implemented in connection with, for example, the RAN node 40 or at the RAN node 40.

In particular, the proposed techniques may be applied to specific applications and communication scenarios, including providing various services within a wireless network, including so-called Over The Top (OTT) services. For example, the proposed techniques enable and/or include transmission and/or reception of relevant user data and/or control data in wireless communications.

Hereinafter, a set of illustrative, non-limiting examples will now be described with reference to fig. 22 and 23.

Fig. 22 is a schematic diagram illustrating an example of a telecommunications network connected to a host computer via an intermediate network, in accordance with some embodiments.

Referring to fig. 22, according to an embodiment, a communication system includes a telecommunication network QQ410 (e.g., a 3GPP type cellular network), the telecommunication network QQ410 including an access network QQ411 (e.g., a radio access network) and a core network QQ414. The access network QQ411 includes a plurality of base stations QQ412a, QQ412b, QQ412c (e.g., NB, eNB, gNB or other types of wireless access points), each base station defining a corresponding coverage area QQ413a, QQ413b, QQ413c. Each base station QQ412a, QQ412b, QQ412c may be connected to the core network QQ414 by a wired or wireless connection QQ 415. The first UE QQ491 located in the coverage area QQ413c is configured to be wirelessly connected to the corresponding base station QQ412c or paged by the corresponding base station QQ412 c. A second UE QQ492 in coverage area QQ413a is wirelessly connectable to a corresponding base station QQ412a. Although multiple UEs QQ491, QQ492 are shown in this example, the disclosed embodiments are equally applicable to situations where a unique UE is in the coverage area or where a unique UE is connected to the corresponding base station QQ 412.

The telecommunications network QQ410 itself is connected to a host computer QQ430, which host computer QQ430 may be implemented in hardware and/or software of a stand-alone server, a cloud-implemented server, a distributed server, or as processing resources in a server cluster. The host computer QQ430 may be under all or control of the service provider or may be operated by or on behalf of the service provider. The connections QQ421 and QQ422 between the telecommunications network QQ410 and the host computer QQ430 may extend directly from the core network QQ414 to the host computer QQ430, or may be made via an optional intermediate network QQ 420. The intermediate network QQ420 may be one or a combination of more than one of a public, private or bearer network, the intermediate network QQ420 (if present) may be a backbone network or the internet, and in particular the intermediate network QQ420 may comprise two or more subnetworks (not shown).

The communication system of fig. 22 as a whole realizes the connection between the connected UEs QQ491, QQ492 and the host computer QQ 430. This connection may be described as an Over The Top (OTT) connection QQ450. The host computer QQ430 and connected UE QQs 491, QQ492 are configured to communicate data and/or signaling via OTT connection QQ450 using the access network QQ411, the core network QQ414, any intermediate network QQ420 and possibly other infrastructure (not shown) as intermediaries. OTT connection QQ450 may be transparent in the sense that the participating communication devices through which OTT connection QQ450 passes are unaware of the routing of uplink and downlink communications. For example, the base station QQ412 may not be notified or the base station QQ412 may not be notified of past routes of incoming downlink communications with data from the host computer QQ430 to be forwarded (e.g., handed over) to the connected UE QQ 491. Similarly, the base station QQ412 need not be aware of future routes of outgoing uplink communications from the UE QQ491 to the host computer QQ 430.

An example implementation of the UE, base station and host computer discussed in the previous paragraph according to an embodiment will now be described with reference to fig. 23. In communication system QQ500, host computer QQ510 includes hardware QQ515, and hardware QQ515 includes communication interface QQ516, with communication interface QQ516 configured to establish and maintain wired or wireless connections with interfaces of different communication devices of communication system QQ 500. The host computer QQ510 also includes processing circuitry QQ518, which may have storage and/or processing capabilities. In particular, the processing circuitry QQ518 may include one or more programmable processors adapted to execute instructions, application specific integrated circuits, field programmable gate arrays, or a combination thereof (not shown). The host computer QQ510 also includes software QQ511 that is stored in the host computer QQ510 or accessible to the host computer QQ510 and executable by the processing circuitry QQ 518. Software QQ511 includes host application QQ512. The host application QQ512 is operable to provide services to remote users (e.g., UE QQ 530), with the UE QQ530 connected via OTT connection QQ550 terminating at the UE QQ530 and host computer QQ 510. In providing services to remote users, host application QQ512 may provide user data that is sent using OTT connection QQ 550.

The communication system QQ500 further includes a base station QQ520 provided in the telecommunication system, the base station QQ520 including hardware QQ525 enabling it to communicate with the host computer QQ510 and with the UE QQ 530. The hardware QQ525 may include a communication interface QQ526 for establishing and maintaining wired or wireless connections with interfaces of different communication devices of the communication system QQ500, and a radio interface QQ527 for at least establishing and maintaining a wireless connection QQ570 with a UE QQ530 located in a coverage area (not shown in fig. 23) served by the base station QQ 520. The communication interface QQ526 can be configured to facilitate a connection QQ560 to the host computer QQ 510. The connection QQ560 may be direct or it may pass through a core network (not shown in fig. 23) of the telecommunication system and/or through one or more intermediate networks outside the telecommunication system. In the illustrated embodiment, the hardware QQ525 of the base station QQ520 further includes a processing circuit QQ528, and the processing circuit QQ528 may include one or more programmable processors adapted to execute instructions, application specific integrated circuits, field programmable gate arrays, or a combination thereof (not shown). The base station QQ520 also has software QQ521 stored internally or accessible via an external connection.

The communication system QQ500 also includes the already mentioned UE QQ530. The hardware QQ535 may include a radio interface QQ537 configured to establish and maintain a wireless connection QQ570 with a base station serving the coverage area in which the UE QQ530 is currently located. The hardware QQ535 of the UE QQ530 also includes processing circuitry QQ538, which may include one or more programmable processors adapted to execute instructions, application specific integrated circuits, field programmable gate arrays, or a combination thereof (not shown). UE QQ530 also includes software QQ531 that is stored in UE QQ530 or accessible to UE QQ530 and executable by processing circuitry QQ 538. Software QQ531 includes client application QQ532. The client application QQ532 is operable to provide services to human or non-human users via the UE QQ530 under the support of the host computer QQ 510. In host computer QQ510, executing host application QQ512 may communicate with executing client application QQ532 via OTT connection QQ550 terminating at UE QQ530 and host computer QQ 510. In providing services to users, the client application QQ532 may receive request data from the host application QQ512 and provide user data in response to the request data. OTT connection QQ550 may transmit both request data and user data. The client application QQ532 may interact with the user to generate user data that it provides.

Note that the host computer QQ510, the base station QQ520, and the UE QQ530 shown in fig. 23 may be similar to or identical to one of the host computer QQ430, the base station QQ412a, the QQ412b, the QQ412c, and one of the UE QQ491, QQ492, respectively, of fig. 22. That is, the internal workings of these entities may be as shown in fig. 23, and independently, the surrounding network topology may be the network topology of fig. 22.

In fig. 23, OTT connection QQ550 has been abstractly drawn to illustrate communications between host computer QQ510 and UE QQ530 via base station QQ520, without explicitly referring to any intermediate devices and the precise routing of messages via these devices. The network infrastructure may determine the route, which may be configured to be hidden from the UE QQ530 or from the service provider operating the host computer QQ510, or from both. The network infrastructure may also make its decision to dynamically change routes (e.g., based on load balancing considerations or reconfiguration of the network) while OTT connection QQ550 is active.

The wireless connection QQ570 between the UE QQ530 and the base station QQ520 is in accordance with the teachings of the embodiments described throughout this disclosure. One or more of the various embodiments improve the performance of OTT services provided to UE QQ530 using OTT connection QQ550, with wireless connection QQ570 forming the last segment in OTT connection QQ 550.

The measurement process may be provided for the purpose of monitoring the data rate, latency, and other factors of one or more embodiments improvements. There may also be optional network functions for reconfiguring the OTT connection QQ550 between the host computer QQ510 and the UE QQ530 in response to a change in the measurement results. The measurement procedures and/or network functions for reconfiguring OTT connection QQ550 may be implemented in software QQ511 and hardware QQ515 of host computer QQ510 or in software QQ531 and hardware QQ535 of UE QQ530 or in both. In an embodiment, a sensor (not shown) may be deployed in or in association with the communication device through which OTT connection QQ550 passes, which may participate in the measurement process by providing the value of the monitored quantity exemplified above or providing the value of other physical quantities that software QQ511, QQ531 may use to calculate or estimate the monitored quantity. The reconfiguration of OTT connection QQ550 may include message format, retransmission settings, preferred routing, etc., which need not affect base station QQ520, and which may be unknown or imperceptible to base station QQ 520. Such processes and functions may be known and practiced in the art. In particular embodiments, the measurements may involve proprietary UE signaling that facilitates measurements of throughput, propagation time, latency, etc. by the host computer QQ 510. The measurement may be implemented by software QQ511 and QQ531 enabling the use of OTT connection QQ550 to send messages (in particular, null messages or "false" messages) while they monitor propagation times, errors, etc.

The embodiments described above are to be understood as several illustrative examples of the invention. Those skilled in the art will appreciate that various modifications, combinations, and alterations can be made to the embodiments without departing from the scope of the invention. In particular, different partial solutions in different embodiments may be combined in other configurations where technically feasible. The scope of the invention is, however, defined by the appended claims.

Claims

1. An object localization method, comprising the following steps: for at least one frame of a video stream and based on at least one parameter representing a change between a scene represented by the at least one frame and a scene represented by a reference frame of the video stream, deciding (S1) whether the determination of the position of at least one object (11, 13, 15, 17, 19) in the at least one frame is based on object detection applied to the at least one frame or on a transformation of the position of the at least one object (11, 13, 15, 17, 19) detected in the reference frame to the position in the at least one frame, wherein the change is caused by a camera configured to record video and generate the video stream changing its position, and the at least one parameter is at least one of the following: a zoom factor, a rotation angle, or a translation of the scene.

2. The object localization method according to claim 1, further comprising the step of estimating (S10) a transformation matrix based on reference key points derived from the reference frame and key points derived from the at least one frame, wherein the transformation matrix defines a transformation of a position in the reference frame to a position in the at least one frame, and wherein the decision step (S1) comprises: based on the at least one parameter, deciding (S1) whether the determination of the position of the at least one object (11, 13, 15, 17, 19) in the at least one frame is based on object detection applied to the at least one frame or on a transformation of the position of the at least one object (11, 13, 15, 17, 19) detected in the reference frame using the transformation matrix.

3. An object localization method according to claim 2, wherein the decision step (S1) comprises: based on the at least one parameter derived from the transformation matrix, deciding (S1) whether the determination of the position of the at least one object (11, 13, 15, 17, 19) in the at least one frame is based on object detection applied to the at least one frame or on a transformation of the position of the at least one object (11, 13, 15, 17, 19) detected in the reference frame using the transformation matrix.

4. The object positioning method according to any one of claims 1 to 3, further comprising the step of: receiving (S20) the at least one parameter from at least one sensor (2, 3) of the user equipment (1).

5. The object positioning method according to any one of claims 1 to 3, wherein the determining step (S1) comprises:

determining (S31) a position of the at least one object (11, 13, 15, 17, 19) in the at least one frame based on a transformation of the position of the at least one object (11, 13, 15, 17, 19) detected in the reference frame if any rotation of the scene represented by the at least one frame relative to the scene represented by the reference frame exceeds a threshold rotation, otherwise

determining (S33) a position of the at least one object (11, 13, 15, 17, 19) in the at least one frame based on object detection applied to the at least one frame if any reduction in the scene represented by the at least one frame relative to the scene represented by the reference frame exceeds a threshold reduction, otherwise

determining (S35) a position of the at least one object (11, 13, 15, 17, 19) in the at least one frame based on object detection applied to the at least one frame if any translation of the scene represented by the at least one frame relative to the scene represented by the reference frame exceeds a threshold translation, otherwise

The position of the at least one object (11, 13, 15, 17, 19) in the at least one frame is determined (S36) based on a transformation of the position of the at least one object (11, 13, 15, 17, 19) detected in the reference frame.

6. The object localization method according to any one of claims 1 to 3, further comprising the step of comparing the at least one parameter with a corresponding threshold value (S40), wherein the decision step (S1) comprises deciding (S1) based on the comparison whether the determination of the position of the at least one object (11, 13, 15, 17, 19) in the at least one frame is based on object detection applied to the at least one frame or on a transformation of the position of the at least one object (11, 13, 15, 17, 19) detected in the reference frame.

7. The object positioning method according to claim 6, wherein the at least one parameter comprises a rotation angle , and wherein the decision step (S1) comprises: if , then based on the transformation of the position of the at least one object (11, 13, 15, 17, 19) detected in the reference frame, determine (S31) the position of the at least one object (11, 13, 15, 17, 19) in the at least one frame, otherwise, based on the object detection applied to the at least one frame, determine (S33, S35) the position of the at least one object (11, 13, 15, 17, 19) in the at least one frame, wherein is the threshold value.

8. The object localization method according to claim 6, wherein the at least one parameter comprises a horizontal scaling factor and the vertical scaling factor , and wherein the decision step (S1) comprises: if , then based on the object detection applied to the at least one frame, determining (S33) the position of the at least one object (11, 13, 15, 17, 19) in the at least one frame, otherwise, based on the transformation of the position of the at least one object (11, 13, 15, 17, 19) detected in the reference frame, determining (S36) the position of the at least one object (11, 13, 15, 17, 19) in the at least one frame, wherein is the threshold value.

9. The object positioning method according to claim 6, wherein the at least one parameter comprises a horizontal translation and vertical translation , and wherein the decision step (S1) comprises: if , then based on the object detection applied to the at least one frame, determining (S35) the position of the at least one object (11, 13, 15, 17, 19) in the at least one frame, otherwise, based on the transformation of the position of the at least one object (11, 13, 15, 17, 19) detected in the reference frame, determining (S36) the position of the at least one object (11, 13, 15, 17, 19) in the at least one frame, wherein is the threshold value.

10. The object localization method according to claim 6, wherein the at least one parameter comprises a horizontal scaling factor , vertical scaling factor , rotation angle , Horizontal translation and vertical translation , and wherein the decision step (S1) comprises:

if , then determining (S31) the position of the at least one object (11, 13, 15, 17, 19) in the at least one frame based on the transformation of the position of the at least one object detected in the reference frame, wherein is the threshold, otherwise

if or , then based on the object detection applied to the at least one frame, determining (S33, S35) the position of the at least one object (11, 13, 15, 17, 19) in the at least one frame, wherein is the threshold, otherwise

Based on the transformation of the position of the at least one object (11, 13, 15, 17, 19) detected in the reference frame, the position of the at least one object (11, 13, 15, 17, 19) in the at least one frame is determined (S36).

11. The object positioning method according to any one of claims 1 to 3 and 7 to 10, further comprising the following steps:

A time parameter representing a time period from the reference frame to the at least one frame in the video stream With threshold Comparison is made (S50); and

if , then based on the object detection applied to the at least one frame, determine (S51) the position of the at least one object (11, 13, 15, 17, 19) in the at least one frame, otherwise, decide (S1) based on the at least one parameter whether the determination of the position of the at least one object (11, 13, 15, 17, 19) in the at least one frame is based on the object detection applied to the at least one frame or on a transformation of the position of the at least one object (11, 13, 15, 17, 19) detected in the reference frame.

12. An object localization method according to any one of claims 1 to 3 and 7 to 10, wherein the decision step (S1) comprises: deciding (S1) based on the at least one parameter whether the determination of the bounding box (21, 23, 25, 27, 29) defining the area in the at least one frame is based on object detection applied to the at least one frame or on a transformation of the bounding box (21, 23, 25, 27, 29) in the reference frame.

13. The object localization method according to claim 12, wherein the decision step (S1) comprises: deciding (S1) based on the at least one parameter whether to determine a bounding box (21, 23, 25, 27, 29) defining a rectangular area in the at least one frame based on object detection applied to the at least one frame, or to determine a bounding box (21, 23, 25, 27, 29) defining a quadrilateral area in the at least one frame based on a transformation of a bounding box (21, 23, 25, 27, 29) in the reference frame.

14. The object localization method according to any one of claims 1 to 3, 7 to 10 and 13, further comprising the step of enhancing (S2) the at least one frame using perceptual information based on the position of the at least one object in the at least one frame.

15. An object locator (100, 120), comprising:

processing circuitry (101, 122); and

A memory (102, 121) comprising instructions executable by the processing circuit (101, 122), wherein the processing circuit (101, 122) is operable to: for at least one frame of a video stream and based on at least one parameter representing a change between a scene represented by the at least one frame and a scene represented by a reference frame of the video stream, determine whether the determination of the position of at least one object (11, 13, 15, 17, 19) in the at least one frame is based on object detection applied to the at least one frame or based on a transformation of the position of the at least one object (11, 13, 15, 17, 19) detected in the reference frame to the position in the at least one frame, wherein the change is caused by a camera configured to record video and generate a video stream changing its position, and the at least one parameter is at least one of the following: a zoom factor, a rotation angle, or a translation of the scene.

16. The object locator (100, 120) of claim 15, wherein the processing circuit (101, 122) is operative to:

estimating a transformation matrix based on reference keypoints derived from the reference frame and keypoints derived from the at least one frame, wherein the transformation matrix defines a transformation of a position in the reference frame to a position in the at least one frame; and

Based on the at least one parameter, it is decided whether the determination of the position of the at least one object (11, 13, 15, 17, 19) in the at least one frame is based on object detection applied to the at least one frame, or based on a transformation of the position of the at least one object (11, 13, 15, 17, 19) detected in the reference frame using the transformation matrix.

17. An object locator (100, 120) according to claim 16, wherein the processing circuit (101, 122) is operative to: based on the at least one parameter derived from the transformation matrix, decide whether the determination of the position of the at least one object (11, 13, 15, 17, 19) in the at least one frame is based on object detection applied to the at least one frame or based on a transformation of the position of the at least one object (11, 13, 15, 17, 19) detected in the reference frame using the transformation matrix.

18. The object locator (100, 120) according to any one of claims 15 to 17, wherein the processing circuit (101, 122) is operative to receive the at least one parameter from at least one sensor (2, 3) of a user device (1).

19. The object locator (100, 120) according to any one of claims 15 to 17, wherein the processing circuit (101, 122) is operative to:

determining a position of the at least one object (11, 13, 15, 17, 19) in the at least one frame based on a transformation of the position of the at least one object (11, 13, 15, 17, 19) detected in the reference frame if any rotation of the scene represented by the at least one frame relative to the scene represented by the reference frame exceeds a threshold rotation, otherwise

determining a position of the at least one object (11, 13, 15, 17, 19) in the at least one frame based on object detection applied to the at least one frame if any reduction in the scene represented by the at least one frame relative to the scene represented by the reference frame exceeds a threshold reduction, otherwise

determining a position of the at least one object (11, 13, 15, 17, 19) in the at least one frame based on object detection applied to the at least one frame if any translation of the scene represented by the at least one frame relative to the scene represented by the reference frame exceeds a threshold translation, otherwise

The position of the at least one object (11, 13, 15, 17, 19) in the at least one frame is determined based on a transformation of the position of the at least one object (11, 13, 15, 17, 19) detected in the reference frame.

20. The object locator (100, 120) according to any one of claims 15 to 17, wherein the processing circuit (101, 122) is operative to:

comparing the at least one parameter to a corresponding threshold value; and

Based on the comparison, it is decided whether the determination of the position of the at least one object (11, 13, 15, 17, 19) in the at least one frame is based on object detection applied to the at least one frame or on a transformation of the position of the at least one object (11, 13, 15, 17, 19) detected in the reference frame.

21. The object locator (100, 120) of claim 20, wherein the at least one parameter comprises a rotation angle , and wherein the processing circuit (101, 122) is operable to: if , then based on the transformation of the position of the at least one object (11, 13, 15, 17, 19) detected in the reference frame, the position of the at least one object (11, 13, 15, 17, 19) in the at least one frame is determined, otherwise, based on the object detection applied to the at least one frame, the position of the at least one object (11, 13, 15, 17, 19) in the at least one frame is determined, wherein is the threshold value.

22. The object locator (100, 120) of claim 20, wherein the at least one parameter comprises a horizontal scaling factor and the vertical scaling factor , and wherein the processing circuit (101, 122) is operable to: if , then based on the object detection applied to the at least one frame, determining the position of the at least one object (11, 13, 15, 17, 19) in the at least one frame, otherwise, determining the position of the at least one object (11, 13, 15, 17, 19) in the at least one frame based on the transformation of the position of the at least one object (11, 13, 15, 17, 19) detected in the reference frame, wherein is the threshold value.

23. The object locator (100, 120) of claim 20, wherein the at least one parameter comprises a horizontal translation and vertical translation , and wherein the processing circuit (101, 122) is operable to: if , then determining the position of the at least one object (11, 13, 15, 17, 19) in the at least one frame based on the object detection applied to the at least one frame, otherwise determining the position of the at least one object (11, 13, 15, 17, 19) in the at least one frame based on a transformation of the position of the at least one object (11, 13, 15, 17, 19) detected in the reference frame, wherein is the threshold value.

24. The object locator (100, 120) of claim 20, wherein the at least one parameter comprises a horizontal scaling factor , vertical scaling factor , rotation angle , Horizontal translation and vertical translation , and wherein the processing circuit (101, 122) is operative to:

if , then based on the transformation of the position of the at least one object detected in the reference frame, the position of the at least one object (11, 13, 15, 17, 19) in the at least one frame is determined, wherein is the threshold, otherwise

if or , then based on the object detection applied to the at least one frame, determining the position of the at least one object (11, 13, 15, 17, 19) in the at least one frame, wherein is the threshold, otherwise

Based on a transformation of the position of the at least one object (11, 13, 15, 17, 19) detected in the reference frame, a position of the at least one object (11, 13, 15, 17, 19) in the at least one frame is determined.

25. The object locator (100, 120) of any one of claims 15 to 17 and 21 to 24, wherein the processing circuit (101, 122) is operative to:

A time parameter representing a time period from the reference frame to the at least one frame in the video stream With threshold Make comparisons; and

if , then based on the object detection applied to the at least one frame, the position of the at least one object (11, 13, 15, 17, 19) in the at least one frame is determined; otherwise, based on the at least one parameter, it is decided whether the determination of the position of the at least one object (11, 13, 15, 17, 19) in the at least one frame is based on the object detection applied to the at least one frame or on the transformation of the position of the at least one object (11, 13, 15, 17, 19) detected in the reference frame.

26. An object locator (100, 120) according to any one of claims 15 to 17 and 21 to 24, wherein the processing circuit (101, 122) is operative to: decide based on the at least one parameter whether the determination of the bounding box (21, 23, 25, 27, 29) defining the area in the at least one frame is based on object detection applied to the at least one frame or based on a transformation of the bounding box (21, 23, 25, 27, 29) in the reference frame.

27. An object locator (100, 120) according to claim 26, wherein the processing circuit (101, 122) is operative to: decide based on the at least one parameter whether to determine a bounding box (21, 23, 25, 27, 29) defining a rectangular area in the at least one frame based on object detection applied to the at least one frame, or to determine a bounding box (21, 23, 25, 27, 29) defining a quadrilateral area in the at least one frame based on a transformation of a bounding box (21, 23, 25, 27, 29) in the reference frame.

28. The object locator (100, 120) of any one of claims 15 to 17, 21 to 24 and 27, wherein the processing circuit (101, 122) is operative to enhance the at least one frame with perceptual information based on the position of the at least one object in the at least one frame.

29. A user equipment (1), comprising:

The object locator (100, 120) according to any one of claims 15 to 28; and

The camera (2) is configured to record video and generate a video stream.

30. The user equipment (1) according to claim 29, further comprising: at least one sensor (2, 3) configured to generate at least one parameter.

31. A computer-readable storage medium having stored thereon a computer program (240) comprising instructions, which, when executed on a processing circuit (210), cause the processing circuit (210) to: for at least one frame of a video stream and based on at least one parameter representing a change between a scene represented by the at least one frame and a scene represented by a reference frame of the video stream, determine whether a determination of a position of at least one object in the at least one frame is based on object detection applied to the at least one frame or on a transformation of a position of the at least one object detected in the reference frame to a position in the at least one frame, wherein the change is caused by a camera configured to record video and generate a video stream changing its position, and the at least one parameter is at least one of the following: a zoom factor, a rotation angle, or a translation of the scene.