CN113360599A

CN113360599A - Multi-source heterogeneous information convergence cooperative processing platform based on content identification

Info

Publication number: CN113360599A
Application number: CN202110541644.1A
Authority: CN
Inventors: 付睿智; 田苗; 张建斌
Original assignee: Suzhou Haisai Artificial Intelligence Co ltd
Current assignee: Fu Zhizhi
Priority date: 2021-05-18
Filing date: 2021-05-18
Publication date: 2021-09-07

Abstract

The invention discloses a multi-source heterogeneous information convergence cooperative processing platform based on content identification, which comprises: basic environment layer, data resource layer, business processing layer and application service layer, basic environment layer includes: a hardware-supported environment and a software-supported environment, the hardware-supported environment comprising: distributed storage environments and distributed computing environments; the data resource layer includes: the map data, the business data, the full-text retrieval data, the unstructured data, the business processing intermediate data and the business processing result data, wherein the data resource layer provides a uniform data source and support for the business processing layer; the service processing layer comprises: the system comprises a multi-source data collection module, a preprocessing module and an automatic monitoring and warehousing module; the method can conveniently enable a large amount of data to have searching, analyzing and exploring capabilities, support multidimensional information inquiry of units, types, time, hot spots, keywords and the like, realize near-real-time full-text searching of documents, and effectively improve the efficiency of full-text retrieval of the data.

Description

Multi-source heterogeneous information convergence cooperative processing platform based on content identification

Technical Field

The invention relates to the technical field of information sharing and cooperation systems, in particular to a multi-source heterogeneous information convergence cooperative processing platform based on content identification.

Background

From international development, with the integration and development of high and new technologies such as artificial intelligence, big data, cloud computing and the like in military fields of various countries, the national defense science and technology information service is accelerating to promote the traditional document electronization, the heterogeneous data integration and the field knowledge association, and the future war form is gradually changed from informatization to intellectualization. The integrated development of intelligent technology and military intelligence brings great revolution to the strategy, organization, priority and resource allocation of developed countries such as the United states. Information work in the field of national defense is no longer the state of manual collection, processing and analysis in the past, and automation and intellectualization become necessary trends of information development.

At present, army information systems are built for many years, information data transmission and processing infrastructures are preliminarily constructed, massive information data are accumulated, types comprise formatted data, semi-formatted data and unformatted data, bearing forms comprise texts, data packets, pictures, videos, high-resolution images and the like, and data with different types, formats and structures are effectively integrated and processed without a unified platform. The centralized storage, efficient query and associated application of intelligence data are important issues to be solved urgently. The following problems are highlighted: firstly, the hardware environment is weak, and the increasing demand of data can not be satisfied. Secondly, the data is highly dispersed, and no associated application capability is formed. Thirdly, the data standard is not uniform, and a preprocessing method and technology are lacked. And fourthly, the deep mining is not enough, and the intelligence value of mass data is not exerted. Fifthly, the shared service capability is weak, and the diversified on-demand guarantee capability is not enough.

Based on the background, massive formatted, semi-formatted and unformatted information data are accumulated in the existing business system, the bearing form comprises texts, data packets, pictures, videos, high-resolution images and the like, and the data with different types, formats and structures are not effectively integrated and processed on a unified platform at present. The traditional integration mode still remains in the manual degree, most of the integration modes depend on manual identification and judgment of information processing personnel, including manual uploading and manual classification and warehousing, and urgent needs to be promoted in the aspects of data scale, timeliness, high efficiency and accuracy of data processing. On the other hand, according to the application condition of the currently accumulated mass information data, the highly dispersed information data has no associated application capability, the deep mining of the data is insufficient, the information value of the mass data is not exerted, and meanwhile, the information data with different formats is difficult to finish automatic classification and storage, and the data scale, quality and application level are all optimized, so that the efficiency of acquiring effective data by information staff is low.

Disclosure of Invention

The invention overcomes the defects of the prior art and provides a multi-source heterogeneous information convergence cooperative processing platform based on content identification.

In order to achieve the purpose, the invention adopts the technical scheme that: a multi-source heterogeneous information convergence cooperative processing platform based on content identification comprises: basic environment layer, data resource layer, business processing layer and application service layer, basic environment layer includes: a hardware-supported environment and a software-supported environment, the hardware-supported environment comprising: distributed storage environments and distributed computing environments.

The data resource layer includes: the map data, the service data, the full-text retrieval data, the unstructured data, the service processing intermediate data and the service processing result data, the data resource layer is used for managing and storing the intermediate data and the result data generated in the information processing process of the service processing layer, and the data resource layer provides a uniform data source and support for the service processing layer.

The service processing layer comprises: a plurality of base modules, a plurality of said base modules comprising: the system comprises a multi-source data collection module, a preprocessing module and an automatic monitoring and warehousing module; the application service layer is used for providing full-text retrieval and intelligence data classification display on the basis of business processing.

In a preferred embodiment of the present invention, the service processing layer further includes: the system comprises a backup module, a file moving module, a preprocessing module, a system core module and an extraction module.

In a preferred embodiment of the present invention, the software support environment comprises: MySQL database, search Elasticissearch engine, Java/Python development environment, and Docker application container engine.

In a preferred embodiment of the present invention, the method comprises: a server side and a Web client side which are connected through signals,

the server side comprises: an access server, which is respectively connected with a file storage server and a database server, both of which are connected with an application server,

the application server is respectively connected with the map server and the Web server, the application server is also connected with the full text retrieval server, and the map server provides a map engine, map data and map network configuration;

the Web client includes: the Web client is used for information display, browsing and auditing and system management.

In a preferred embodiment of the invention, the access server provides a multi-source data access service, adapts to different data sources, and converts and extracts data information.

In a preferred embodiment of the present invention, the file storage server provides a distributed storage service for storing files and pictures.

In a preferred embodiment of the present invention, the database server manages core service data, and implements data backup and data recovery.

In a preferred embodiment of the present invention, the application server is configured to provide core service management and control services, configure service plug-ins and service modules, and provide interfaces.

The second technical scheme of the invention is that the method comprises the following steps:

step S1: the automatic monitoring and warehousing module monitors a new file input by a message data source, and provides disaster-tolerant backup of data by using the backup module so as to ensure that original and finished data cannot be lost under an extreme environment;

step S2: copying or moving the new file to a corresponding working directory through a file moving module, preprocessing the new file by a preprocessing module according to a file format, and transmitting the processed data to an extracting module;

step S3: analyzing the data in the new file through a system core module, and performing warehousing operation on the extracted specific information so as to be called and displayed conveniently; and simultaneously, moving the file to a file storage directory for calling by the front end and the back end.

In a preferred embodiment of the present invention, the preprocessing module in step S2 performs data modeling and knowledge generation, and constructs a knowledge base oriented to business, so as to form data processing rules.

The invention solves the defects in the background technology, and has the following beneficial effects:

(1) the invention can conveniently enable a large amount of data to have searching, analyzing and exploring capabilities, support multidimensional information inquiry of units, types, time, hot spots, keywords and the like, simultaneously support title and full text retrieval, carry out content-based intelligent analysis on the unstructured data in a warehouse, realize full text retrieval of all data of a platform, and can quickly and accurately position and search according to the conditions of titles, texts, incoming telegram units, receiving time and the like. The Elasticissearch can store data in the form of JSON documents, and the data structure of the inverted index used by the Elasticissearch can list each unique word appearing in all documents, and can find all documents containing each word, so that full-text search can be performed on the documents in near real time, and the efficiency of full-text retrieval of data is effectively improved.

(2) The invention carries out document preprocessing according to the predefined document type, provides strong character recognition preprocessing capability aiming at the picture document, firstly carries out character recognition on the picture document, then carries out automatic and processing aiming at the recognized document content, refines the process of the whole system for document calibration, ensures the uniformity and the integrity of the document at the document layer, ensures that the system process can not be repeatedly used, realizes the function similar to document formatting, ensures that a data extraction module can still normally work, normally extracts important information, normally stores in a warehouse and the like under the condition of not changing any important configuration at the data layer.

(3) The invention supports automatic classification of heterogeneous unstructured information data such as text, picture, video, voice and the like, comprehensively utilizes the picture character recognition technology based on deep learning, the image recognition technology, voice recognition, natural voice processing, the multi-mode deep learning classification algorithm based on semi-supervision and the like to realize automatic classification and grading of information, and improves the efficiency of information reading.

(4) The invention adopts the Elasticissearch search engine as the middleware to index and create the data which is put into the database, thereby ensuring the high efficiency of full-text retrieval. The Elasticissearch is a distributed, high-expansion and high-real-time search and data analysis engine. The method can conveniently enable a large amount of data to have the capability of searching, analyzing and exploring, support the multi-dimensional information inquiry of units, types, time, hot spots, keywords and the like, and simultaneously support the title and full-text retrieval. The Elasticissearch can store data in the form of JSON documents, and the data structure of the inverted index used by the Elasticissearch can list each unique word appearing in all documents, and can find all documents containing each word, so that full-text search can be performed on the documents in near real time, and the efficiency of full-text retrieval of data is effectively improved.

(5) The invention carries out correlation analysis on the information by using data mining and big data analysis technologies, automatically extracts effective information data, automatically summarizes and analyzes the effective information data, and visually displays the information and situation information in a rich front-end visual chart mode, thereby realizing information correlation analysis and visual analysis of the information. The information extraction technology based on the knowledge graph, the big data analysis technology, the data mining technology and other technologies are comprehensively used for carrying out correlation analysis and visual analysis on the information, and effective information such as entities, relations, events and the like in the information is extracted.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a general architecture diagram of a system in accordance with a preferred embodiment of the present invention;

FIG. 2 is a diagram of a system network topology structure in accordance with a preferred embodiment of the present invention;

FIG. 3 is a system component diagram of the preferred embodiment of the present invention;

FIG. 4 is a schematic diagram of the document identification operation of the preferred embodiment of the present invention;

FIG. 5 is a schematic diagram of a full text search process according to a preferred embodiment of the present invention.

Detailed Description

In order that the above objects, features and advantages of the present invention can be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflict.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those specifically described herein, and therefore the scope of the present invention is not limited by the specific embodiments disclosed below.

As shown in fig. 1, a multi-source heterogeneous information convergence cooperative processing platform based on content identification includes: basic environment layer, data resource layer, business processing layer and application service layer, basic environment layer includes: a hardware-supported environment and a software-supported environment, the hardware-supported environment comprising: distributed storage environments and distributed computing environments.

In a preferred embodiment of the present invention, the service processing layer includes: a plurality of base modules, a plurality of said base modules comprising: the system comprises a multi-source data collection module, a preprocessing module and an automatic monitoring and warehousing module; dynamic information data are accessed in real time through a multi-source data collection module, and various information achievements analyzed and researched in recent years and information reported by troops at all levels are integrated. Through the preprocessing module, extraction, cleaning, comparison, analysis and audit are carried out, and information data are screened and intelligently analyzed based on contents to form standardized data.

In a preferred embodiment of the invention, the document preprocessing is carried out according to the predefined document type, the process of the whole system for document calibration is refined, the uniformity and the integrity of the document are ensured at the document layer, the system process is not repeatedly used, the function similar to document formatting is realized, and the data extraction module can still normally work, normally extract important information, normally store in a warehouse and the like at the data layer under the condition of not changing any important configuration. In addition, the system provides strong character recognition preprocessing capability for the picture document, firstly performs character recognition on the picture document, and then performs automatic processing on the recognized document content.

In a preferred embodiment of the present invention, the service processing layer further includes: the system comprises a backup module, a file moving module, a system core module and an extraction module.

In a preferred embodiment of the present invention, the application service layer is used for providing full text retrieval and intelligence data classification display on the basis of business processing. The automatic monitoring and warehousing module is used for automatically monitoring and warehousing data, monitoring unstructured data in a designated folder in real time, and realizing the real-time automatic storage of documents through steps of 'content analysis', 'automatic warehousing' and the like; and providing a classification algorithm, realizing the detailed classification of the message data based on content analysis, and automatically classifying and displaying. The method supports automatic classification of heterogeneous unstructured information data such as texts, pictures, videos and voices, and comprehensively utilizes technologies such as picture character recognition technology based on deep learning, image recognition technology, voice recognition, natural voice processing and multi-mode deep learning classification algorithm based on semi-supervision to realize automatic classification and grading of the information.

The software support environment includes: MySQL database, search Elasticissearch engine, Java/Python development environment and Docker application container engine; an Elasticissearch search engine is used as a middleware to index and create the data which is put into a database, so that the high efficiency of full-text retrieval is ensured. The Elasticissearch is a distributed, high-expansion and high-real-time search and data analysis engine.

On the other hand, the invention can conveniently enable a large amount of data to have searching, analyzing and exploring capabilities, support multidimensional information inquiry of units, types, time, hot spots, keywords and the like, simultaneously support title and full text retrieval, carry out content-based intelligent analysis on the stored unstructured data, realize full text retrieval of all data of a platform, and can quickly and accurately position and search according to the conditions of titles, texts, incoming telegram units, text receiving time and the like. The Elasticissearch can store data in the form of JSON documents, and the data structure of the inverted index used by the Elasticissearch can list each unique word appearing in all documents, and can find all documents containing each word, so that full-text search can be performed on the documents in near real time, and the efficiency of full-text retrieval of data is improved.

As shown in fig. 2, in a preferred embodiment of the present invention, the network topology includes: the server side and the Web client side are connected through signals, and the server side comprises: the access server provides multi-source data access service, adapts to different data sources, and converts and extracts data information, the access server is respectively connected with the file storage server and the database server, the file storage server provides distributed storage service for storing files and pictures, the database server manages core service data and realizes data backup and data recovery, and the file storage server and the database server are both connected with the application server. The application server is respectively connected with the map server and the Web server, the application server is used for providing core business management and control service, configuring a service plug-in, a business module and providing an interface, the application server is also connected with the full text retrieval server, and the map server provides a map engine, map data and map network configuration; the Web client includes: the Web client is used for information display, browsing and auditing and system management.

In a preferred embodiment of the present invention, the method for recognizing characters in an OCR picture includes the following steps:

step S1: and (3) preprocessing the picture, namely denoising the picture, and detecting and correcting the picture needing to be rotated.

Step S2: and text positioning, namely positioning the area with characters in the picture and finding out a boundary box of a word or a text line.

Step S3: and (4) character recognition, namely recognizing the positioned characters, and combining the step S1 with the step S2 to obtain the end-to-end detection of the characters.

The method uses context related information, recognizes Chinese character texts through vocabularies, takes a Chinese character sequence with a deterministic boundary as a processing unit, uses word co-occurrence probability obtained through statistics, adopts a dynamic programming method, calls a plurality of predefined processing sets in a plurality of processing units, and processes target information.

As shown in fig. 5, in a preferred embodiment of the present invention, a full text search method comprises the following steps:

step S1: the indexing process comprises the steps of collecting source data from a relational database, the Internet and a file system, collecting the source data to a unified place, creating an index into an index database, extracting key information from the source database, and extracting a word from the key information, wherein the word is associated with the source data; namely, when the index is created, the word is related to the source data, the association is recorded in the index database, and if the word is found, the source data is found;

step S2: in the searching process, a user executes searching and searching to compile a query keyword, searches an index from an index database, searches a word in the index database according to the query keyword, and finally displays the searching result.

By the full-text retrieval method, the labor of people can be effectively reduced, and the processing efficiency is effectively improved. Meanwhile, the invention uses distributed real-time search, TB-level data can return a search result in millisecond level, the query range is effectively reduced, and by using the Chinese word segmentation plug-in of the elastic search, accurate word segmentation can be realized, and the search efficiency is improved.

In a preferred embodiment of the invention, the data mining and big data analysis technology is used for carrying out correlation analysis on the information, extracting the message information and carrying out statistical analysis on the chart, thereby realizing the statistical chart display of patrol analysis, cross-line scouting analysis and airplane patrol analysis, automatically extracting effective information data, automatically summarizing and analyzing, carrying out visual display on the information and situation information in a rich front-end visual chart mode, and realizing the information correlation analysis and visual analysis of the information. The information extraction technology based on the knowledge graph, the big data analysis technology, the data mining technology and other technologies are comprehensively used for carrying out correlation analysis and visual analysis on the information, and effective information such as entities, relations, events and the like in the information is extracted.

As shown in fig. 3, in a preferred embodiment of the present invention, a CPU + GPU-based computing architecture is adopted, a character recognition engine, a distributed file storage, a relational database, and the like are used as basic components, and based on technologies such as a distributed aggregation search engine, a mass data intelligent analysis, a machine learning, and the like, functions such as multi-source data acquisition, data conversion processing, data sorting, data reporting, data query, and the like are realized. Meanwhile, core service scenes and functions of the platform are realized through a hierarchical structure of infrastructure management, data storage, application components and service interfaces, and the platform has an application system environment running across operating systems and platforms.

In a preferred embodiment of the present invention, the comprehensive test results are compared with the third party commercial document management system as follows:

TABLE 1 comparison of the results of the comprehensive testing with the third-party commercial document management System

As shown in the table above, the invention comprehensively utilizes the technologies of picture character recognition technology based on deep learning, image recognition technology, voice recognition, natural language processing, multi-mode deep learning classification algorithm based on semi-supervision and the like to realize the automatic classification and grading of the heterogeneous unstructured information data such as texts, pictures, videos, voices and the like. The method can standardize the method in the data processing and obtaining process in the service system, and realize synchronous promotion of data scale, data processing efficiency, data quality and application level, thereby promoting the efficiency and capability of obtaining effective data by intelligence personnel and well meeting the current and future requirements of organizations.

In a preferred embodiment of the invention, the data mining and big data analysis technology is used for carrying out correlation analysis on the information, effective information data is automatically extracted, the effective information data is automatically collected and analyzed, information and situation information are visually displayed in a rich front-end visual chart mode, and accurate data service and intelligent decision support in a data explosion environment are provided for a user.

The invention is designed aiming at the characteristics of massive formatted, semi-formatted and unformatted information data in the existing business system, adopts the technologies of content analysis, OCR picture character recognition, data mining, big data analysis, full text retrieval and the like, and solves the main problems of high dispersion of storage of various information data, low retrieval efficiency, loose association application, low guarantee benefit and the like. The system and the method realize automatic grading, classifying and warehousing of formatted, semi-formatted and unformatted information data, automatic index creation and storage, support functions of information query and retrieval in various modes, information summarization and correlation analysis, visual display of statistical and analysis results and the like, and effectively improve the information service guarantee level.

The invention can be fused with the existing service system, fully utilizes the existing information data resources, standardizes the methods in the data processing and obtaining processes, excavates and releases the potential value of the data resources, and realizes the synchronous promotion of the data scale, the data processing efficiency, the data quality and the application level, thereby promoting the efficiency and the capability of obtaining effective data by the information personnel, and having very high practical value and application prospect in the army.

As shown in fig. 4, when the present invention works, the automatic monitoring and warehousing module monitors a new file input by a message data source, and the backup module provides disaster recovery backup of data to ensure that original and finished data are not lost in an extreme environment; the new file is copied or moved to the corresponding working directory through the file moving module, meanwhile, the preprocessing module preprocesses the new file according to the file format, the preprocessing module carries out data modeling and knowledge generation, a service-oriented knowledge base is constructed, a data processing rule is formed, and the processed data are transmitted to the extracting module. Analyzing the data in the new file through a system core module, and performing warehousing operation on the extracted specific information so as to be called and displayed conveniently; and simultaneously, moving the file to a file storage directory for calling by the front end and the back end.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Furthermore, it should be understood that although the present description refers to embodiments, not every embodiment may contain only a single embodiment, and such description is for clarity only, and those skilled in the art should integrate the description, and the embodiments may be combined as appropriate to form other embodiments understood by those skilled in the art.

Claims

1. A multi-source heterogeneous information convergence cooperative processing platform based on content identification comprises: a basic environment layer, a data resource layer, a business processing layer and an application service layer, which are characterized in that,

the base environment layer includes: a hardware-supported environment and a software-supported environment, the hardware-supported environment comprising: distributed storage environments and distributed computing environments;

the data resource layer includes: the map data, the service data, the full-text retrieval data, the unstructured data, the service processing intermediate data and the service processing result data are stored in the data resource layer, the data resource layer is used for managing and storing the intermediate data and the result data generated in the information processing process of the service processing layer, and the data resource layer provides a uniform data source and support for the service processing layer;

the service processing layer comprises: a plurality of base modules, a plurality of said base modules comprising: the system comprises a multi-source data collection module, a preprocessing module and an automatic monitoring and warehousing module;

the application service layer is used for providing full-text retrieval and intelligence data classification display on the basis of business processing.

2. The multi-source heterogeneous information convergence cooperative processing platform based on content identification as claimed in claim 1, wherein: the service processing layer further comprises: the system comprises a backup module, a file moving module, a preprocessing module, a system core module and an extraction module.

3. The multi-source heterogeneous information convergence cooperative processing platform based on content identification as claimed in claim 1, wherein: the software support environment includes: MySQL database, search Elasticissearch engine, Java/Python development environment, and Docker application container engine.

4. The network topology of the multisource heterogeneous intelligence convergence cooperative processing platform based on content identification as claimed in claim 1, comprising: a server side and a Web client side which are connected through signals,

5. The multi-source heterogeneous intelligence convergence collaborative processing platform based on content identification as claimed in claim 4, wherein: the access server provides multi-source data access service, adapts to different data sources, and converts and extracts data information.

6. The multi-source heterogeneous intelligence convergence collaborative processing platform based on content identification as claimed in claim 4, wherein: the file storage server provides distributed storage service for storing files and pictures.

7. The multi-source heterogeneous intelligence convergence collaborative processing platform based on content identification as claimed in claim 4, wherein: the database server manages the core service data, and realizes data backup and data recovery.

8. The multi-source heterogeneous intelligence convergence collaborative processing platform based on content identification as claimed in claim 4, wherein: the application server is used for providing core service management and control service, configuring service plug-in and service module and providing interface.

9. The working method of the multi-source heterogeneous intelligence convergence cooperative processing platform based on the content identification as claimed in claim 1, comprising the following steps:

10. The multi-source heterogeneous intelligence convergence collaborative processing platform based on content identification according to claim 9, wherein: and step S2, the preprocessing module carries out data modeling and knowledge generation, constructs a knowledge base facing to business and forms a data processing rule.