CN107368501A

CN107368501A - The processing method and processing device of data

Info

Publication number: CN107368501A
Application number: CN201610319458.2A
Authority: CN
Inventors: 高春光; 蒋佳涛; 鲁艳阳; 陈艺天
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2016-05-13
Filing date: 2016-05-13
Publication date: 2017-11-21
Anticipated expiration: 2036-05-13
Also published as: CN107368501B

Abstract

The present invention discloses a kind of processing method and processing device of data.This method includes：Data are inquired about, the generation data exchange table that section collects on schedule；The data exchange table is handled using spark computings framework, including：The data exchange table is read, stripping and slicing is carried out to data by a data attribute；By each node of the data distribution of stripping and slicing to server cluster；Each node carries out the calculating related to the data attribute；Collect the implementing result of each node；Wherein, each node is configured with R language or python language computing modules, for handling imponderable part in spark computing frameworks.The data processing method and device of the present invention is based on spark computing frameworks, with reference to R language or python language, makes data processing more accurate complete.

Description

The processing method and processing device of data

Technical field

The disclosure relates in general to field of computer technology, and in particular to the processing method and dress of a kind of data Put.

Background technology

In present big data processing, because data volume is big, it is sometimes desirable to the content of calculating is more complicated, Meet to require using single treatment technology is more difficult.For example, in large-scale electric business enterprise, for evaluation The value of commodity, the general concept for using key value commodity KVI (Key Value Item), is being counted , it is necessary to handle mass data when calculating KVI indexes.And mass data can not be handled on unit or computing Overlong time, can not ensure calculate real-time, but single treatment technology again be relatively difficult to ensure card meter The integrality and accuracy of calculation.

Therefore, for the calculating of mass data, for example, KVI indexes calculating, it is necessary to a kind of new side Method.

Above- mentioned information is only used for strengthening the reason to the background of the disclosure disclosed in the background section Solution, therefore it can include not forming the information to prior art known to persons of ordinary skill in the art.

The content of the invention

The disclosure provides a kind of processing method and processing device of data, can quickly and accurately handle magnanimity The calculating of data.

Other characteristics and advantage of the disclosure will be apparent from by following detailed description, or partly By the practice of the disclosure and acquistion.

According to the disclosure in a first aspect, a kind of processing method of data, including：Data are looked into Ask, the generation data exchange table that section collects on schedule；Using spark computings framework to the number Handled according to swap table, including：

The data exchange table is read, stripping and slicing is carried out to data by a data attribute；

By each node of the data distribution of stripping and slicing to server cluster；

Each node carries out the calculating related to the data attribute；Collect the implementing result of each node；

Wherein, each node is configured with R language or python language computing modules, for handling spark Imponderable part in computing framework.

It is described inquiry is carried out to data to include according to an embodiment of the disclosure：Inquired about using Hive Language HQL is inquired about in data warehouse.

According to an embodiment of the disclosure, the reading data exchange table includes：Using spark Sql like language reads the data exchange table.

According to an embodiment of the disclosure, the data of the stripping and slicing are with the shape of elasticity distribution formula data set Formula stores.

According to an embodiment of the disclosure, the spark computings framework and the R language or Python language computing module carries out data exchange by pipeline.

According to an embodiment of the disclosure, the data attribute is commodity category, the data exchange Table includes commodity list, order table and flowmeter, and each node carries out related to the data attribute Calculating includes calculating merchandise valuation index, and merchandise valuation index is visited according to the user in a predetermined amount of time The amount of asking, commodity page mean residence time, sales volume, the different weights of the pulling amount of money are calculated.

According to an embodiment of the disclosure, the predetermined amount of time is 7 days or 30 days.

According to an embodiment of the disclosure, according to the merchandise valuation index calculated fraction from Arrive small sequence greatly, preceding 20% commodity are defined as most important commodity in sequencing table, 20% in sequencing table after extremely Commodity before 50% are defined as key commodity, 50% in sequencing table after be defined as typically to the commodity before 80% Commodity, 80% in sequencing table after commodity be defined as inessential commodity.

According to the second aspect of the disclosure, a kind of processing unit of data, including：Summarizing module, use Inquired about in data, the generation data exchange table that section collects on schedule；Processing module, use In being handled using spark computings framework the data exchange table, including：

Each node carries out the calculating related to the data attribute；

Collect the implementing result of each node；

R language or python language computing modules, by handle in spark computing frameworks can not based on The part of calculation.

The data processing method and device of present embodiment, based on spark computing frameworks, possesses processing The ability of mass data, mass data can be quickly handled, and combine R language or python languages Computing module is sayed, makes data processing more accurate complete.Calculating merchandise valuation index KVI (Key Value Item during), also consider commodity page mean residence time simultaneously and pull the amount of money two because Element, make merchandise valuation index KVI more accurate, provided the foundation preferably to hold merchandise valuation strategy.

It should be appreciated that the general description and following detailed description of the above are only exemplary, and The disclosure can not be limited.

Brief description of the drawings

Its example embodiment, above and other target of the disclosure, spy is described in detail by referring to accompanying drawing Sign and advantage will become apparent.

Fig. 1 shows the system architecture figure according to disclosure example embodiment.

Fig. 2 shows the process flow figure according to the data of disclosure example embodiment.

Fig. 3 shows the process flow figure of another data according to disclosure example embodiment.

Fig. 4 shows the processing unit block diagram according to the data of disclosure example embodiment.

Embodiment

Example embodiment is described more fully with referring now to accompanying drawing.However, example embodiment energy It is enough to implement in a variety of forms, and it is not understood as limited to example set forth herein；Conversely, there is provided this A little embodiments cause the disclosure more fully and completely and the design of example embodiment is comprehensive Ground is communicated to those skilled in the art.Accompanying drawing is only the schematic illustrations of the disclosure, is not necessarily It is drawn to scale.Identical reference represents same or similar part in figure, thus will omission pair Their repeated description.

In addition, described feature, structure or characteristic can be incorporated in one in any suitable manner Or more in embodiment.In the following description, there is provided many details are so as to providing to this public affairs The embodiment opened is fully understood.It will be appreciated, however, by one skilled in the art that this can be put into practice Disclosed technical scheme and omit one or more in the specific detail, or can use other Method, constituent element, step etc..In other cases, it is not shown in detail or describes known features, side Method, realization or operation are to avoid that a presumptuous guest usurps the role of the host and so that each side of the disclosure thickens.

Some block diagrams shown in accompanying drawing are functional entitys, not necessarily must with it is physically or logically only Vertical entity is corresponding.These functional entitys can be realized using software form, or at one or more These functional entitys are realized in individual hardware module or integrated circuit, or are filled in heterogeneous networks and/or processor Put and/or microcontroller device in realize these functional entitys.

As shown in figure 1, the system architecture that the present invention uses is based on spark computing frameworks, Data exchange table 100 is inquired about into data warehouse using Hive query languages HQL.Utilize spark SQL Data exchange table 100 is read, data exchange table 100 may include the data logger for recording each calculating parameter 101st, record sheet 102 and record sheet 103.

Each record sheet collected is processed, needs progress stripping and slicing according to calculating.Stripping and slicing Data are deposited in the form of elasticity distribution formula data set RDD (Resilient Distributed Datasets) Storage.Elasticity distribution formula data set RDD (Resilient Distributed Datasets) is in distribution The abstract concept deposited, RDD provide a kind of height-limited shared drive model, i.e. RDD is only The set of the record partitioning of reading, it can only be created by performing the conversion operation of determination in other RDD, But these limit and to realize that fault-tolerant expense is very low.For developer, RDD can be regarded as A Spark object, itself are run in internal memory, and it is a RDD such as to read file, to file meter A RDD at last, result set are also a RDD, dependence between different bursts, data, The map data of key-value types can regard RDD as.

Each node server in the data distribution being cut into small pieces to server cluster is calculated, its In, R the or python language modules of each node server prepackage.If run into calculating process The part that can not be realized in spark, it can be calculated using R the or python language modules of prepackage. After the completion of calculating, the implementing result that spark collects each node data piecemeal is aggregated into a big result text Part 200, call hive import statement that destination file 200 is imported into hive data warehouses, make Used for result for inquiry.Wherein, Spark and R or python utilizes the pipeline (pipe) of operating system Carry out data exchange.

As shown in Fig. 2 the processing method of the data is the system architecture based on Fig. 1, including Step S202~S204：

In step S202, data are inquired about, the generation data exchange that section collects on schedule Table.

Data to be processed are collected in inquiry, and the data being collected into are generated into what section to schedule collected Data exchange table, the predetermined amount of time can be needed by being manually set according to business.

In step S204, data swap table is handled using spark computings framework.

Data swap table is handled using spark computings framework.Data exchange table is read, according to One data attribute carries out stripping and slicing to data, and the data attribute can be the parameter being related in calculating, for example, When calculating electric business enterprise marketing key value commodity KVI (Key Value Item), it is related to commodity product Class, data attribute now can be commodity category, and data are carried out into stripping and slicing according to commodity category.Due to Need data volume to be processed larger, by the way that data stripping and slicing is divided into multiple small block datas by mass data.

Under spark computing frameworks, by each node of the data distribution being cut into small pieces to server cluster Server.

Each node server carries out the calculating related to data attribute, for example, calculating electric business enterprise pin When selling key value commodity KVI (Key Value Item), each node server enters according to commodity category Row classification, according to the sales volume and click volume of every kind of commodity, calculate the KVI indexes of commodity.

After each node server calculates, collect the implementing result of each node server, will be all Result of calculation is aggregated into a big destination file, imported into data warehouse.For example, hive can be called Import statement by result data imported into hive data warehouse tables for use.

In above-mentioned calculating process, each node server can configure R language or python language calculates Module, for handling imponderable part in spark computing frameworks.In calculating process, if The algorithm bag lacked in spark scientific algorithms storehouse be present, then can be by using R language or python Language computing module is supplemented, completely to be calculated.

The data processing method of present embodiment, based on spark computing frameworks, possesses processing magnanimity number According to ability, can quickly handle mass data, and combine R language or python language calculates Module, make data processing more accurate complete.

According to an example embodiment, when inquiring about data, Hive query languages HQL can be used Inquired about in data warehouse.

According to an example embodiment, spark computings framework calculates mould with R language or python language Block carries out data exchange by operating system pipeline.

The commodity amount of large-scale electric business enterprise marketing is huge, it is also very desirable to is best understood by which commodity more Adding influences impression of the user to store, to keep the superiority to rival.In evaluation commodity In terms of value, key value commodity KVI (Key Value Item) concept is used.KVI commodity are Refer to Price Sensitive commodity, the change of price the sales volume of commodity and related other commodity can be produced compared with Big influence.And a commodity are KVI commodity, can go to weigh from multiple dimensions, including it is clear The amount of looking at, purchase volume etc..Consider these aspects of each commodity, you can which business drawn Product most attract user to browse, most easily buy customer, and these KVI commodity can more influence to use than other commodity Impression of the family to store.It is existing evaluation commodity value be according to the sales volume and click volume of commodity be divided into A, B, C, D4 class evaluates the significance level of commodity, and A is most important, D is least important.Sales volume and Click volume is gone out to represent the value of the significance level of commodity by certain weight COMPREHENSIVE CALCULATING.Wherein, the pin of commodity Amount is to utilize the tables of data inquiry for recording sales volume in database to obtain, and click volume can be visited using user Page code record is obtained when asking the page.Then sales volume and the processing of click volume fiducial markization are obtained from 0 Go out a comprehensive numerical value to 1 data value, then by certain weight calculation：

K=w1*sales_quantity+w2*traffic,

Wherein sales_quantity represents sales volume, and traffic represents click volume.Prior art is to business The evaluation of product is not comprehensive, does not account for pulling, the index of these material impact commodity of page residence time. A kind of sales volume of commodity itself is possible and little, but may pull the sale of other commodity.

In order to calculate the KVI indexes of commodity well, it is necessary to be carried out using volume of data treatment technology Support.R language is very powerful Data Analysis Services language, is well suited for adding line number to KVI index meters The realization of Data preprocess and algorithm；Python language also has very strong scientific algorithm ability, has abundant Scientific algorithm storehouse, KVI indexes, which are calculated, the good degree of accuracy and the guarantee of performance；Hadoop platforms The data storage of demand can be calculated KVI indexes and computing provides base layer support；Hive platforms are bases In hadoop database platform.

As shown in figure 3, calculate merchandise valuation index KVI (Key using above-mentioned data processing method Value Item), wherein data attribute can be commodity category SKU (Stock Keeping Unit), number Include commodity list, order table and flowmeter according to swap table, each node server is carried out and commodity category SKU (Stock Keeping Unit) related calculating, including calculate the price index KVI (Key of commodity Value Item).Wherein, commodity category SKU (Stock Keeping Unit) is that stock passes in and out meter The unit of amount, can be with part, box, pallet etc..SKU (Stock Keeping Unit) is usually A kind of necessary method of big chain store home-delivery center logistics management, has been extended to product now The abbreviation of Unified number, every kind of product are corresponding with unique SKU (Stock Keeping Unit) number. Including step step S302~S304：

In step s 302, data are inquired about, the generation commodity list that section collects on schedule, Order table and flowmeter.

The related data of merchandise sales is collected in inquiry, can use Hive query languages HQL in data warehouse Middle collection collects each item data, and the data the being collected into generation data that section collects to schedule are handed over Table is changed, the data exchange table of generation includes commodity list, order table and flowmeter.The predetermined amount of time can Needed according to business by being manually set, can be with 7 days, i.e., one week is the time limit, can also 30 days, i.e., one Individual month is the cycle, can need to set according to analysis.Using spark computings framework to data swap table Handled, can be according to the commodity category SKU (Stock Keeping Unit) in commodity list to data Stripping and slicing is carried out, by each node server of the data distribution being cut into small pieces to server cluster, each node Server carries out related calculating.

In step s 304, commodity list, order table and flowmeter are carried out using spark computings framework Processing.

When calculating merchandise valuation index KVI (Key Value Item), commodity list, order table are utilized With the information in flowmeter, user's visit capacity, commodity page mean residence time, sales volume and drawing are drawn The contents such as the dynamic amount of money, consider above-mentioned each factor, and different weights, the setting of weight can be set Many-sided needs can be considered, according to business need sets itself.

After each node server calculates, collect the implementing result of each node server, will be all Result of calculation is aggregated into a big destination file, imported into data warehouse.For example, hive can be called Import statement by result data imported into hive data warehouse tables as a result for use.

In above-mentioned calculating process, each node server can configure R language or python language calculates Module, for handling imponderable part in spark computing frameworks.If spark scientific algorithms The algorithm bag lacked in storehouse be present, then can be by using R language or python language computing modules To be supplemented, completely to be calculated.

The data processing method of present embodiment, based on spark computing frameworks, possesses processing magnanimity number According to ability, can quickly handle mass data, and combine R language or python language calculates Module, make data processing more accurate complete.Calculating merchandise valuation index KVI (Key Value Item) During, commodity page mean residence time is also considered simultaneously and pulls two factors of the amount of money, is made Merchandise valuation index KVI is more accurate, is provided the foundation preferably to hold merchandise valuation strategy.

According to an example embodiment, for merchandise valuation index KVI (the Key Value being calculated Item), can be ranked up from big to small according to the fraction of merchandise valuation index, preceding 20% in sequencing table Commodity can be identified as most important commodity, 20% in sequencing table after can be identified as crucial business to the commodity before 50% Product, 50% in sequencing table after can be identified as general merchandise to the commodity before 80%, 80% in sequencing table after Commodity can be identified as inessential commodity., can be according to the significance level of commodity point when formulating sales tactics Different sales tactics is not formulated, can the most important commodity of emphasis consideration and key commodity.It is above-mentioned for business The division of product significance level, is merely illustrative, for merchandise valuation index KVI (Key Value Item) Utilization, can voluntarily determine scope according to actual conditions.

As shown in figure 4, a kind of processing unit of data, including：

Summarizing module 402, for inquiring about data, the generation data that section collects on schedule are handed over Change table.

Processing module 404, for being handled using spark computings framework data swap table.

Under spark computing frameworks, by each node of the data distribution being cut into small pieces to server cluster Server.Each node server carries out the calculating related to data attribute, for example, calculating electric business enterprise When industry sells key value commodity KVI (Key Value Item), each node server is according to sales volume KVI index calculating is carried out with click volume.

R language or python language computing module 406, for handling nothing in spark computing frameworks The part that method calculates.

The data processing equipment of present embodiment, based on spark computing frameworks, possesses processing magnanimity number According to ability, can quickly handle mass data, and combine R language or python language calculates Module, make data processing more accurate complete.

In large-scale electric business enterprise, using above-mentioned data processing equipment, merchandise valuation index KVI is calculated (Key Value Item), wherein data attribute can be commodity category SKU (Stock Keeping Unit), Data exchange table includes commodity list, order table and flowmeter, and each node server enters to calculate determining for commodity Valency index KVI (Key Value Item), each module specifically performs following functions：

Summarizing module 402, the related data of merchandise sales is collected for inquiring about, the data being collected into are given birth to The data exchange table collected into section to schedule, the predetermined amount of time can be needed by people according to business , can be with 7 days for setting, i.e. a week is the time limit, can also be 30 days, i.e., one month is the time limit, It can need to set according to analysis.

It can be collected using Hive query languages HQL in data warehouse and collect each item data, generate data Swap table, including commodity list, order table and flowmeter.

Processing module 404, for being handled using spark computings framework data swap table, it will cut Into fritter data distribution to each node server of server cluster, each node server carries out commodity The index KVI (Key Value Item) that fixes a price is calculated.

When calculating merchandise valuation index KVI (Key Value Item), consider user's visit capacity, Factor, each factors such as commodity page mean residence time, sales volume and the pulling amount of money can be set different Weight, the setting of weight can consider many-sided needs, according to business need sets itself.

The data processing equipment of present embodiment, based on spark computing frameworks, possesses processing magnanimity number According to ability, can quickly handle mass data, and combine R language or python language calculates Module, make data processing more accurate complete.Calculating merchandise valuation index KVI (Key Value Item) During, commodity page mean residence time is also considered simultaneously and pulls two factors of the amount of money, is made Merchandise valuation index KVI is more accurate, is provided the foundation preferably to hold merchandise valuation strategy.

The illustrative embodiments of the disclosure are particularly shown and described above.It should be appreciated that The disclosure is not limited to detailed construction, set-up mode or implementation method described herein；On the contrary, the disclosure It is intended to cover comprising various modifications in the spirit and scope of the appended claims and equivalence setting.

Claims

A kind of 1. processing method of data, it is characterised in that including：

Data are inquired about, the generation data exchange table that section collects on schedule；

The data exchange table is handled using spark computings framework, including：

The data exchange table is read, stripping and slicing is carried out to data by a data attribute；

By each node of the data distribution of stripping and slicing to server cluster；

Each node carries out the calculating related to the data attribute；

Collect the implementing result of each node；

Wherein, each node is configured with R language or python language computing modules, for handling spark Imponderable part in computing framework.
2. processing method as claimed in claim 1, it is characterised in that described that data are inquired about Including：Inquired about using Hive query languages HQL in data warehouse.
3. processing method as claimed in claim 1, it is characterised in that described to read the data friendship Changing table includes：The data exchange table is read using spark sql like language.
4. processing method as claimed in claim 1, it is characterised in that the data of the stripping and slicing are with bullet Property distributed data collection form storage.
5. processing method as claimed in claim 1, it is characterised in that the spark computings framework Data exchange is carried out by pipeline with the R language or python language computing module.
6. processing method as claimed in claim 1, it is characterised in that the data attribute is commodity Category, the data exchange table include commodity list, order table and flowmeter, each node carry out with The related calculating of the data attribute includes calculating merchandise valuation index, and merchandise valuation index is pre- according to one Fix time user's visit capacity in section, commodity page mean residence time, sales volume, pull the amount of money not Calculated with weight.
7. processing method as claimed in claim 6, it is characterised in that the predetermined amount of time is 7 It or 30 days.
8. processing method as claimed in claims 6 or 7, it is characterised in that according to the institute calculated The fraction for stating merchandise valuation index sorts from big to small, and preceding 20% commodity are defined as most important in sequencing table Commodity, 20% in sequencing table after be defined as key commodity to the commodity before 50%, 50% in sequencing table after extremely Commodity before 80% are defined as general merchandise, 80% in sequencing table after commodity be defined as inessential commodity.
A kind of 9. processing unit of data, it is characterised in that including：

Summarizing module, for inquiring about data, the generation data exchange that section collects on schedule Table；

Processing module, for being handled using spark computings framework the data exchange table, bag Include：

The data exchange table is read, stripping and slicing is carried out to data by a data attribute；

By each node of the data distribution of stripping and slicing to server cluster；

Each node carries out the calculating related to the data attribute；

Collect the implementing result of each node；

R language or python language computing modules, by handle in spark computing frameworks can not based on The part of calculation.
10. processing unit as claimed in claim 9, it is characterised in that the data attribute is business Product category, the data exchange table include commodity list, order table and flowmeter, and each node is carried out The calculating related to the data attribute includes calculating merchandise valuation index, and merchandise valuation index is according to one User's visit capacity, commodity page mean residence time, sales volume in predetermined amount of time, pull the amount of money Different weights are calculated.
11. processing unit as claimed in claim 10, it is characterised in that the predetermined amount of time is 7 days or 30 days.
12. the processing unit as described in claim 10 or 11, it is characterised in that according to calculating The fraction of the merchandise valuation index sort from big to small, preceding 20% commodity are defined as most in sequencing table Important goods, 20% in sequencing table after be defined as key commodity to the commodity before 50%, 50% in sequencing table Be defined as general merchandise to the commodity before 80% afterwards, 80% in sequencing table after commodity be defined as inessential business Product.