CN117609095A

CN117609095A - Code large model-oriented evaluation set quality detection method and device

Info

Publication number: CN117609095A
Application number: CN202311756490.3A
Authority: CN
Inventors: 申敏; 叶青
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2023-12-19
Filing date: 2023-12-19
Publication date: 2024-02-27

Abstract

The embodiment of the specification provides a code large model-oriented evaluation set quality detection method and device. The method comprises the following steps: performing static detection on the evaluation set to obtain a first detection result, wherein the static detection comprises at least one of the following steps: carrying out integrity detection on each evaluation sample, carrying out sample repeatability detection on an evaluation set, and carrying out sample balance detection on the evaluation set under different types of code tasks; performing code accuracy detection by running code programs related to each evaluation sample in the evaluation set to obtain a second detection result; and determining the quality detection result of the evaluation set at least according to the first detection result and the second detection result.

Description

Code large model-oriented evaluation set quality detection method and device

Technical Field

One or more embodiments of the present disclosure relate to the field of machine learning technology, and in particular, to a method and apparatus for detecting quality of an evaluation set for a large code model, a computer readable storage medium, and a computing device.

Background

Large models mainly refer to deep learning processing systems of great scale and complexity, containing hundreds of billions or more of parameters, which can computationally generate corresponding reply content based on user input. A code large model refers to a large model in which predictive tasks involve code, where code refers to a high-level programming language used to write computer programs, such as java/python/golang, etc., and code-related task types include, but are not limited to, code generation/code line annotation/code function interpretation, etc.

Before the large code model is put into practical use, performance evaluation is generally required to be performed on the large code model so as to guide improvement and optimization of the model and provide decision basis for a decision maker, so that reliability and effectiveness of the model in practical application are ensured, and user experience is further ensured. The performance evaluation process needs to use an evaluation set, and the quality of the evaluation set is crucial to the reliability of the evaluation result, for example, a low-quality evaluation set may cause misleading of the evaluation result, undiscovered problems, unreliable comparison and ranking, and the like, thereby causing serious consequences such as decision making errors, model product faults, customer trust loss, and the like.

Under the background, the embodiment of the specification provides a quality detection method for an evaluation set, which can comprehensively and accurately detect the quality of the evaluation set, so as to ensure the reliability of the result of performance evaluation for a large code model.

Disclosure of Invention

The embodiment of the specification describes a code large model-oriented evaluation set quality detection method and device, and can realize comprehensive and accurate quality detection of an evaluation set.

According to a first aspect, there is provided a code large model-oriented evaluation set quality detection method, including:

performing static detection on the evaluation set to obtain a first detection result, wherein the static detection comprises at least one of the following steps: and performing integrity detection on each evaluation sample, performing sample repeatability detection on an evaluation set, and performing sample balance detection on the evaluation set under different task types. And performing code accuracy detection by running code programs related to each evaluation sample in the evaluation set to obtain a second detection result. And determining the quality detection result of the evaluation set at least according to the first detection result and the second detection result.

In one embodiment, the integrity detection comprises: judging whether each necessary field is contained in each evaluation sample aiming at each evaluation sample; based on the determination result, it is determined whether the evaluation sample is a complete sample.

In one embodiment, the repeatability detection comprises: analyzing the code fragments extracted from the evaluation samples into grammar trees aiming at any one of the evaluation samples; calculating the similarity between the grammar tree corresponding to the evaluation sample and grammar trees corresponding to other evaluation samples; and determining whether the evaluation sample is a repeated sample based on the magnitude relation between the similarity and the first threshold.

In a specific embodiment, calculating the similarity between the syntax tree corresponding to the evaluation sample and the syntax tree corresponding to other evaluation samples includes: and calculating the editing distance between the grammar tree corresponding to the evaluation sample and grammar trees corresponding to other evaluation samples as the similarity.

In one embodiment, the sample equalization detection comprises: for any task type, determining the statistical quantity of the evaluation samples under the task type in the evaluation set; and determining whether the sample in the evaluation set is balanced or not based on the magnitude relation of the ratio between the statistical quantity and the total sample in the evaluation set compared with a second threshold value.

In one embodiment, the code accuracy detection is performed by running a code program related to each evaluation sample in the evaluation set, so as to obtain a second detection result, which includes: aiming at each evaluation sample, under the condition that the corresponding task type is determined to be code generation, assembling a prompt word, a standard answer and a test case in the evaluation sample into an executable code program, and running the code program; and checking whether the code program is compiled successfully or not and whether the test case passes the test or not, and classifying the code program into the second detection result.

In one embodiment, the method further comprises: and detecting the overlapping degree of the sample set of the evaluating set and the training set of the code large model to obtain a third detection result. Wherein determining the quality detection result of the evaluation set at least according to the first detection result and the second detection result comprises: and determining the quality detection result according to the first detection result, the second detection result and the third detection result.

In a specific embodiment, the sample set overlap detection includes: determining the same sample between the evaluation set and the training set; calculating the ratio between the number of the same samples and the total sample amount of the evaluation set as the overlap degree of the sample set; based on a magnitude relationship between the sample set overlap and a third threshold, determining whether the evaluation set passes the sample set overlap detection.

In a more specific embodiment, determining the same sample between the evaluation set and the training set comprises: searching the training set by adopting a search algorithm based on the characteristics extracted for each evaluation sample in the evaluation set to obtain a training sample with highest matching degree with the evaluation sample; and classifying the evaluation sample into the same sample under the condition that the similarity between the evaluation sample and the searched training sample is larger than a fourth threshold value.

Further, in one example, the extracting of the features includes: under the condition that the task type corresponding to the evaluation sample is code generation, code interpretation or code annotation, extracting features based on natural language text in the evaluation sample; or extracting code fragments in the evaluation sample and analyzing the code fragments into a grammar tree under the condition that the task type corresponding to the evaluation sample is code translation or code completion among different types of programming languages, and extracting features of the grammar tree.

In one embodiment, the method further comprises: and correcting the evaluation set under the condition that the quality detection result of the evaluation set indicates that the quality detection result does not pass the detection.

According to a second aspect, there is provided an evaluation set quality detection apparatus for a large code model, including:

the static detection unit is configured to perform static detection on the evaluation set to obtain a first detection result, and the static detection comprises at least one of the following steps: and performing integrity detection on each evaluation sample, performing sample repeatability detection on an evaluation set, and performing sample balance detection on the evaluation set under different task types. And the dynamic detection unit is configured to detect the accuracy of the codes by running the code program related to each evaluation sample in the evaluation set to obtain a second detection result. And the detection result determining unit is configured to determine the quality detection result of the evaluation set at least according to the first detection result and the second detection result.

In one embodiment, the apparatus further includes an overlap detection unit configured to perform sample set overlap detection on the evaluation set and the training set of the code large model, to obtain a third detection result. The detection result determining unit is specifically configured to: and determining the quality detection result according to the first detection result, the second detection result and the third detection result.

According to a third aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of the first aspect.

According to a fourth aspect, there is provided a computing device comprising a memory having executable code stored therein and a processor which when executing the executable code implements the method of the first aspect.

In the method and the device provided by the embodiment of the specification, static detection, dynamic detection and overlapping degree detection are carried out on the evaluation set, so that comprehensive and accurate quality detection on the evaluation set can be realized, and the reliability of the result of performance evaluation on the code large model is ensured. Specifically, (1) in static detection, through checking the integrity and repeatability of data, the data with problems in the evaluation set are effectively screened and removed, and the accuracy and consistency of the evaluation set are improved; (2) in the dynamic detection, the code program is run in the mirror image, and whether the compiling pass and the use case pass or not is checked, so that the code execution condition of the evaluation set is verified, and the code execution in the evaluation set is ensured to be correct and complete in function; (3) in the overlapping degree detection, whether the data in the evaluation set overlap with the training data of the model is detected through comparison and similarity calculation with the training data of the model, so that the problems of overfitting of the evaluation result and insufficient generalization capability are avoided.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments below are briefly introduced, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of the flow steps of a code-oriented large model evaluation set quality detection method according to one embodiment;

FIG. 2 is a schematic diagram of the steps of a method for detecting the quality of an evaluation set for a large code model according to another embodiment;

FIG. 3 is a block diagram of an embodiment of quality detection of a code large model evaluation set disclosed in an embodiment of the present disclosure;

fig. 4 is a schematic diagram of a functional module structure of a device for detecting quality of an evaluation set for a large code model according to an embodiment of the present disclosure.

Detailed Description

The following describes the scheme provided in the present specification with reference to the drawings.

As stated earlier, low quality evaluation sets can affect the accuracy of the performance evaluation results of a large code model. In this regard, the embodiment of the specification provides a quality detection method for an evaluation set, which can perform comprehensive and accurate quality detection on the evaluation set, so as to ensure the reliability of the result of performance evaluation on a large code model.

FIG. 1 is a schematic diagram of the flow steps of a code-oriented large model evaluation set quality detection method according to one embodiment. It is understood that the subject of execution of the method may be any apparatus, server or cluster of devices, etc. having computing, processing capabilities. As shown in fig. 1, the method comprises the steps of:

step S110, performing static detection on the evaluation set to obtain a first detection result, where the static detection includes at least one of the following: and performing integrity detection on each evaluation sample, performing sample repeatability detection on an evaluation set, and performing sample balance detection on the evaluation set under different task types. And step S120, performing code accuracy detection by running a code program related to each evaluation sample in the evaluation set to obtain a second detection result. And step S130, determining the quality detection result of the evaluation set according to the first detection result and the second detection result.

The development of the above steps is described as follows:

in step S110, static detection is performed on the evaluation set, so as to obtain a first detection result. It should be noted that, the terms "first" in the text "first detection result" and the like, and the terms "second", "third" and the like in other places in the text are all for distinguishing similar things, and do not have other limitation effects such as ordering.

In one embodiment, the static detection includes an integrity detection for each evaluation sample. It can be appreciated that the evaluation sample needs to include an input portion for inputting the code large model, such as a prompt word (prompt), and a label portion (or called reference answer, standard solution) for comparing with the corresponding output of the code large model. Thus, for each evaluation sample, it can be determined whether each necessary field, such as a prompt word, is included in the evaluation sample.

In addition, when the evaluation set relates to different task types, there may be a difference in the composition field of the input part or the tag part in the evaluation sample under the different task types. At this time, for each evaluation sample, the corresponding code task type may be determined first, and then, according to the necessary field dictionary under the code task type, whether the evaluation sample includes all necessary fields, such as test cases, unique indexes, or reference answers, etc., is checked.

Further, based on the result of the above determination, it may be determined whether the corresponding evaluation sample is a complete sample. Illustratively, if all necessary fields are included in the evaluation sample, it is determined to be a complete sample, otherwise it is determined to be an incomplete sample.

Thus, the first detection result of the static detection may include a result of the integrity detection, such as whether each evaluation sample is a complete sample, or may further include a ratio of the complete sample in the evaluation set.

In another embodiment, the static detection includes sample repeatability detection of the evaluation set. Specifically, the following sub-steps are performed for an evaluation sample of any one of the evaluation sets:

1) And analyzing the code program related to the evaluation sample into a grammar tree.

Illustratively, code fragments may be extracted from the evaluation samples, and then parsed into abstract syntax trees (Abstract Syntax Tree, abbreviated as AST) using a syntax parser tool such as antlr 4. Further, if there are a plurality of extracted code segments, the code segments may be parsed into corresponding syntax trees, or parsed after splicing.

2) And calculating the similarity between the grammar tree corresponding to the evaluation sample and grammar trees corresponding to other evaluation samples.

In one example, cosine similarity between syntax trees, or the reciprocal of euclidean distance, may be calculated as the above-described similarity. In another example, edit distances between syntax trees may be calculated as the similarity, and thus, accuracy and usability of the similarity may be improved.

3) Based on the magnitude relation between the similarity and a preset first threshold (such as 90%), whether the evaluation sample is a repeated sample is determined. For example, if it is greater then it is determined to belong to a duplicate sample, otherwise it is determined not to belong.

To aid understanding, an example of code that parses a Java code fragment and makes a similarity determination is shown below:

from the above, sample repeatability detection of the evaluation set can be achieved, and thus the first detection result of the static detection can include the result of the repeatability detection, for example, the number of samples that are repeated or the sample repetition rate, etc. Illustratively, the sample repetition rate may be calculated as:

sample repetition rate=1- (number of samples where no repetition occurs/total number of samples in the evaluation set) (1)

In yet another embodiment, the static detection includes sample equalization detection for the evaluation set under different task types. In a specific embodiment, for any task type, determining a statistical number of evaluation samples under the task type in the evaluation set; further, based on the magnitude relation of the ratio between the statistical quantity and the total quantity of the samples in the evaluation set, compared with a preset second threshold value, whether the samples in the evaluation set are balanced or not is determined. By way of example, the number of code task types involved in the evaluation set is 5, at which time the second threshold may be set to 25%, assuming that the evaluation set is unbalanced as long as there are evaluation samples for a certain type of code task that are more than 25%, otherwise balanced.

In another specific embodiment, for each task type, determining the duty ratio of the evaluation sample in the evaluation set under the task type in the evaluation set, and if the difference between the duty ratios of the samples corresponding to the two task types is greater than a predetermined threshold (e.g., 5%), determining that the evaluation set is unbalanced, otherwise, determining that the evaluation set is balanced. In this way, sample equalization detection may be achieved.

The static detection performed on the evaluation set in step S110 is described above, which specifically includes an integrity detection result, a repeatability detection result, or an equilibrium detection result, etc., and a corresponding first detection result may be obtained.

Simultaneously, before or after the step S110 is performed, a step S120 may be performed, where a code accuracy test (or dynamic test) is performed by running a code program related to each evaluation sample in the evaluation set, so as to obtain a second test result.

Specifically, for each evaluation sample in the evaluation set, the executable code program involved is determined first. It should be understood that the composition of the evaluation samples of different task types may be different, and different code program determination modes may be respectively designed. In one embodiment, for an evaluation sample with a task type of code generation, the prompt words, standard answers and test cases in the evaluation sample are assembled into an executable code program. The following illustrates sample content of a code generation class evaluation sample:

and sequentially splicing the prompt words, the standard solutions and the contents in the test cases in the example to obtain an executable Python program code.

In another embodiment, for an evaluation sample with a task type of code completion, the code in the standard solution may be patched into a corresponding position in the prompt word, so as to obtain an executable code program.

Further, the executable code program obtained above may be run to obtain a running result. For example, a corresponding environment image may be selected according to a programming language of the code program, so that the code program is run in the environment image, and the running result may include whether the code program can be compiled to pass or not, and whether the test case passes or not.

Thus, the above-described second detection result can be determined based on the operation result. In one example, when the running result is that the test case passes, the code can be accurately used as a dynamic detection result of the corresponding evaluation sample, and the dynamic detection result can be classified as a second detection result. In another example, the sample duty ratio of the code execution in the evaluation set may be counted, and in the case that the duty ratio is greater than a predetermined threshold, the evaluation set is dynamically detected as a second detection result, otherwise, the evaluation set is not dynamically detected as a second detection result.

By the method, the dynamic detection of the evaluation sample can be realized, and a second detection result is obtained.

Thereafter, step S130 may be performed to determine a quality detection result of the evaluation set according to the first detection result and the second detection result.

In one embodiment, the quality detection result of the evaluation set is determined to pass if both the first detection result and the second detection result indicate pass, otherwise it is determined not to pass.

In another embodiment, the first detection result and the second detection result may be comprehensively scored, and the obtained score is used as a quality detection result of the evaluation set.

In yet another embodiment, a quality detection report in the form of a graphic text may also be generated as a quality detection result of the evaluation set according to the first detection result and the second detection result.

In summary, by adopting the code large model-oriented evaluation set quality detection method disclosed by the embodiment of the specification, static detection and dynamic detection are carried out on the evaluation set, so that the accuracy and usability of quality detection results can be effectively improved.

According to another embodiment, the present specification proposes that the evaluation set can perform overlap detection in addition to static detection and dynamic detection, which is to consider that the evaluation set of the code large model should be independent from the training set, so as to avoid that the model evaluation uses the same sample as the model evaluation during training, thereby preventing the evaluation result from being overfitted.

Specifically, fig. 2 is a schematic flow chart of a code large model-oriented evaluation set quality detection method according to another embodiment, and an execution subject of the method may be any apparatus, server or device cluster with computing and processing capabilities. As shown in fig. 2, the method comprises the steps of:

step S210, performing static detection on the evaluation set to obtain a first detection result, where the static detection includes at least one of the following: and performing integrity detection on each evaluation sample, performing sample repeatability detection on an evaluation set, and performing sample balance detection on the evaluation set under different task types. And step S220, performing code accuracy detection by running a code program related to each evaluation sample in the evaluation set to obtain a second detection result. And step S230, detecting the overlapping degree of the sample set of the evaluating set and the training set of the code large model to obtain a third detection result. And step S240, determining a quality detection result of the evaluation set according to the first detection result, the second detection result and the third detection result.

The development of the above steps is described as follows:

for steps S210 and S220, reference may be made to the foregoing descriptions of steps S110 and S120, and no further description is given here.

As for step S230, the embodiment of the present specification does not limit the execution order thereof compared to step S110 or step S120. The implementation of step S230 may include:

1) The same samples between the evaluation set and the training set of the code large model are determined.

In one embodiment, for each evaluation sample in the evaluation set, a search algorithm is first used to search the training set based on the features extracted for the evaluation sample, so as to obtain the training sample with the highest matching degree with the evaluation sample.

It is to be understood that for evaluation samples of different task types, feature extraction may be performed on different portions thereof.

In a specific embodiment, in the case that the task type corresponding to the evaluation sample is code generation, feature extraction may be performed based on the prompt word in the evaluation sample. For example, word segmentation processing can be performed on natural language text in the prompt word, so that text word segmentation or an embedded vector of word segmentation is used as an extraction feature of an evaluation sample.

In another specific embodiment, in the case that the task type corresponding to the evaluation sample is code annotation or code function interpretation, feature extraction may be performed on the standard answer portion of the evaluation sample. For example, the code grammar elements, text words, of the standard answer portion may be extracted.

In another specific embodiment, in the case that the task type corresponding to the evaluation sample is code translation between different types of programming languages or code completion, the code segments in the evaluation sample are extracted and parsed into a grammar tree, and feature extraction is performed on the grammar tree. Illustratively, the extracted features may be syntax elements such as method names, parameter types, etc. in the syntax tree.

After the characteristics of the evaluation sample are extracted, a search algorithm, such as BM25 or TF-IDF, can be adopted to search the training set based on the extracted characteristics, so as to obtain the training sample with the highest matching degree with the evaluation sample.

Further, a similarity between the evaluation sample and the searched training sample is calculated. In a specific embodiment, the sample content involved in the similarity calculation is the same as the sample content based on which the sample feature extraction was performed. On the other hand, in a specific embodiment, the cosine similarity or the euclidean distance between samples may be calculated, so as to obtain the similarity between samples.

Then, in case the calculated similarity is greater than a corresponding predetermined threshold (e.g. 0.9), the evaluation sample is classified as the same sample.

The same sample overlapping the training set in the evaluation set can be screened out through the searching algorithm and the similarity calculation. In another embodiment, the search may not be performed, but rather the similarity calculation and determination may be performed directly to determine the same sample, but undoubtedly the calculation is much larger than the search first.

2) And calculating the ratio between the number of the same samples and the total sample amount of the evaluation set as the overlap degree of the sample set.

3) And determining whether the evaluation set passes the detection of the overlapping degree of the sample set or not based on the size relation between the overlapping degree of the sample set and the corresponding threshold value (such as 3 percent), and taking the result as a third detection result. For example, if the overlapping degree of the sample set is greater than the corresponding threshold value, the detection of the overlapping degree is not passed as a third detection result, otherwise, the detection of the overlapping degree is passed as a third detection result.

From the above, by performing the static detection, the dynamic detection, and the overlap detection correspondingly in steps S210, S220, and S230, the first detection result, the second detection result, and the third detection result can be obtained. Thus, S240 may be performed to determine a quality detection result of the evaluation set based on the three detection results.

In one embodiment, the quality detection result of the evaluation set is determined to pass if the first detection result, the second detection result and the third detection result are all indicative of pass, otherwise the quality detection result is determined to not pass.

In another embodiment, a comprehensive score may be performed based on the three detection results, and the obtained score is used as a quality detection result of the evaluation set.

In yet another embodiment, a quality detection report in the form of a graph may also be generated from the three detection junctions as a quality detection result for the evaluation set.

By adopting the evaluating set quality detection method for the code large model disclosed by the embodiment of the specification, static detection, dynamic detection and overlapping degree detection are carried out on the evaluating set, so that comprehensive and accurate quality detection on the evaluating set can be realized, and the reliability of the result of performance evaluation on the code large model is ensured.

According to an embodiment of a further aspect, during or after performing the method shown in fig. 1 or fig. 2, the evaluation set may be modified according to the detection result of the completed detection until the modified final evaluation set reaches the usage standard. For example, in the integrity detection link in static detection, the detected incomplete samples are removed from the evaluation set or retained after being complemented, or in the repeatability detection link, one group of samples in the group are repeated mutually, the other samples are left in the evaluation set, or in the sample balance detection link, the proportion adjustment of various task samples is carried out under the condition that unbalance is detected. For another example, in dynamic detection, samples that do not pass the code accuracy test are removed from the set of evaluations. For another example, in overlap detection, the same samples in the evaluation set that are simultaneously assigned to the training set are removed.

To facilitate understanding, a further exemplary description of the execution flow of the above-described evaluation set quality detection method is provided below in conjunction with the flow chart illustrated in fig. 3. As shown in fig. 3, the quality inspection process includes the following stages:

1) Static verification stage

Static checks include integrity checks, repeatability checks, and sample equalization checks.

1.1 Integrity check)

Traversing each evaluation sample in the evaluation set by using the necessary field dictionary, judging whether necessary fields are empty in the evaluation samples, judging that the data of the evaluation samples are complete if the necessary fields are not empty, and judging that the evaluation samples are missing information if the necessary fields are empty.

1.2 Repeatability check

For each evaluation sample, a tree_size analyzer is utilized to parse the code segments in the evaluation sample to obtain a grammar tree, the similarity between the grammar tree and the code segments of other evaluation samples is calculated, and the evaluation sample and the other evaluation samples are judged to be repeated under the condition that the calculated similarity is larger than a preset threshold (such as 0.8), otherwise, the judgment is not repeated.

1.3 Sample equalization verification)

Classifying the evaluation set according to task types, determining the proportion of samples in each class, judging that the class distribution of the samples is unbalanced if the proportion of samples in a certain class is higher than a preset threshold value, and judging that the samples are balanced if the proportion of samples in the certain class is not equal.

2) Dynamic verification

And (3) assembling an executable code program based on the content in the evaluation sample, running the code program in a mirror image link of a programming language of the code program, and checking the running result, such as whether the code program is successfully compiled or whether the test case passes or not.

3) Overlap detection

And traversing each evaluation sample in the evaluation set, and combining the prompt word and the answer thereof. And carrying out grammar analysis on the codes obtained by combination to obtain corresponding grammar trees, and carrying out feature extraction on the grammar trees. Then, based on the extracted features, a BM25 or TF-IDF searching algorithm is adopted in a training set to search for a training sample with the highest matching degree, similarity calculation is carried out on an evaluation sample and the corresponding searched training sample, and the size judgment between the similarity and a preset threshold (such as 0.9) is carried out, so that an overlapping degree detection result is obtained.

4) And under the condition that the detection results of the static check and the dynamic check and the overlapping degree check are all passed, judging that the final check result of the evaluation set is passed, and otherwise, judging that the final check result is not passed.

In summary, the evaluation set quality detection scheme disclosed in the embodiment of the specification is adopted, (1) in static detection, data with problems in the evaluation set are effectively screened and removed by checking the integrity and repeatability of the data, so that the accuracy and consistency of the evaluation set are improved; (2) in the dynamic detection, the code program is run in the mirror image, and whether the compiling pass and the use case pass or not is checked, so that the code execution condition of the evaluation set is verified, and the code execution in the evaluation set is ensured to be correct and complete in function; (3) in the overlapping degree detection, whether the data in the evaluation set overlap with the training data of the model is detected through comparison and similarity calculation with the training data of the model, so that the problems of overfitting of the evaluation result and insufficient generalization capability are avoided.

Corresponding to the evaluation set quality detection method, the embodiment of the specification also discloses an evaluation set quality detection device. Fig. 4 is a schematic diagram of a functional module structure of a device for detecting quality of an evaluation set for a large code model according to an embodiment of the present disclosure. As shown in fig. 4, the apparatus 400 includes:

the static detection unit 410 is configured to perform static detection on the evaluation set to obtain a first detection result, where the static detection includes at least one of the following: and performing integrity detection on each evaluation sample, performing sample repeatability detection on an evaluation set, and performing sample balance detection on the evaluation set under different task types. And the dynamic detection unit 420 is configured to perform code accuracy detection by running a code program related to each evaluation sample in the evaluation set, so as to obtain a second detection result. The detection result determining unit 430 is configured to determine a quality detection result of the evaluation set at least according to the first detection result and the second detection result.

In one embodiment, the static detection unit 410 includes an integrity detection subunit 411 configured to perform the integrity detection, for each of the evaluation samples, determine whether each of the necessary fields is included in the evaluation sample; based on the determination result, it is determined whether the evaluation sample is a complete sample.

In one embodiment, the static detection unit 410 includes a repeatability detection subunit 412 for performing the repeatability detection, configured to parse code segments extracted therefrom into syntax trees for any one of the evaluation samples in the evaluation set; calculating the similarity between the grammar tree corresponding to the evaluation sample and grammar trees corresponding to other evaluation samples; and determining whether the evaluation sample is a repeated sample based on the magnitude relation between the similarity and the first threshold.

In a specific embodiment, the repeatability detection subunit 412 is specifically configured to calculate, as the similarity, an edit distance between the syntax tree corresponding to the evaluation sample and the syntax tree corresponding to the other evaluation sample.

In an embodiment, the static detection unit 410 comprises an equalization detection subunit 413 for performing the sample equalization detection, configured to: for any task type, determining the statistical quantity of the evaluation samples under the task type in the evaluation set; and determining whether the sample in the evaluation set is balanced or not based on the magnitude relation of the ratio between the statistical quantity and the total sample in the evaluation set compared with a second threshold value.

In one embodiment, the dynamic detection unit 420 is specifically configured to: aiming at each evaluation sample, under the condition that the corresponding task type is determined to be code generation, assembling a prompt word, a standard answer and a test case in the evaluation sample into an executable code program, and running the code program; and checking whether the code program is compiled successfully or not and whether the test case passes the test or not, and classifying the code program into the second detection result.

In one embodiment, the apparatus 400 further comprises: and the overlapping degree detection unit 440 is configured to perform sample set overlapping degree detection on the evaluation set and the training set of the code large model to obtain a third detection result. The detection result determining unit 430 is specifically configured to: and determining the quality detection result according to the first detection result, the second detection result and the third detection result.

In a specific embodiment, the overlapping degree detection unit 440 is specifically configured to: determining the same sample between the evaluation set and the training set; calculating the ratio between the number of the same samples and the total sample amount of the evaluation set as the overlap degree of the sample set; based on a magnitude relationship between the sample set overlap and a third threshold, determining whether the evaluation set passes the sample set overlap detection.

Further, in one example, the overlapping degree detection unit 440 is further configured to: searching the training set by adopting a search algorithm based on the characteristics extracted for each evaluation sample in the evaluation set to obtain a training sample with highest matching degree with the evaluation sample; and classifying the evaluation sample into the same sample under the condition that the similarity between the evaluation sample and the searched training sample is larger than a fourth threshold value.

Still further, the overlapping degree detecting unit 440 is configured to: under the condition that the task type corresponding to the evaluation sample is code generation, code interpretation or code annotation, extracting features based on natural language text in the evaluation sample; or extracting code fragments in the evaluation sample and analyzing the code fragments into a grammar tree under the condition that the task type corresponding to the evaluation sample is code translation or code completion among different types of programming languages, and extracting features of the grammar tree.

In an embodiment, the apparatus 400 further comprises a correction unit 450 configured to correct the evaluation set if the quality detection result of the evaluation set indicates that it fails detection.

According to an embodiment of another aspect, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method described in connection with fig. 1 or fig. 2.

According to an embodiment of yet another aspect, there is also provided a computing device including a memory having executable code stored therein and a processor that, when executing the executable code, implements the method described in connection with fig. 1 or 2. Those skilled in the art will appreciate that in one or more of the examples described above, the functions described in the present invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, these functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.

The foregoing embodiments have been provided for the purpose of illustrating the general principles of the present invention in further detail, and are not to be construed as limiting the scope of the invention, but are merely intended to cover any modifications, equivalents, improvements, etc. based on the teachings of the invention.

Claims

1. A code large model-oriented evaluation set quality detection method comprises the following steps:

performing static detection on the evaluation set to obtain a first detection result, wherein the static detection comprises at least one of the following steps: carrying out integrity detection on each evaluation sample, carrying out sample repeatability detection on an evaluation set, and carrying out sample balance detection on the evaluation set under different task types;

performing code accuracy detection by running code programs related to each evaluation sample in the evaluation set to obtain a second detection result;

and determining the quality detection result of the evaluation set at least according to the first detection result and the second detection result.

2. The method of claim 1, wherein the integrity detection comprises:

judging whether each necessary field is contained in each evaluation sample aiming at each evaluation sample;

based on the determination result, it is determined whether the evaluation sample is a complete sample.

3. The method of claim 1, wherein the repeatability detection comprises:

analyzing the code fragments extracted from the evaluation samples into grammar trees aiming at any one of the evaluation samples;

calculating the similarity between the grammar tree corresponding to the evaluation sample and grammar trees corresponding to other evaluation samples;

and determining whether the evaluation sample is a repeated sample based on the magnitude relation between the similarity and the first threshold.

4. The method according to claim 3, wherein calculating the similarity between the syntax tree corresponding to the evaluation sample and the syntax tree corresponding to the other evaluation sample comprises:

and calculating the editing distance between the grammar tree corresponding to the evaluation sample and grammar trees corresponding to other evaluation samples as the similarity.

5. The method of claim 1, wherein the sample equalization detection comprises:

for any task type, determining the statistical quantity of the evaluation samples under the task type in the evaluation set;

and determining whether the sample in the evaluation set is balanced or not based on the magnitude relation of the ratio between the statistical quantity and the total sample in the evaluation set compared with a second threshold value.

6. The method according to claim 1, wherein the code accuracy detection is performed by running a code program related to each evaluation sample in the evaluation set, and obtaining a second detection result includes:

aiming at each evaluation sample, under the condition that the corresponding task type is determined to be code generation, assembling a prompt word, a standard answer and a test case in the evaluation sample into an executable code program, and running the code program;

and checking whether the code program is compiled successfully or not and whether the test case passes the test or not, and classifying the code program into the second detection result.

7. The method of claim 1, further comprising:

detecting the overlapping degree of the sample set of the evaluating set and the training set of the code large model to obtain a third detection result;

wherein determining the quality detection result of the evaluation set at least according to the first detection result and the second detection result comprises:

and determining the quality detection result according to the first detection result, the second detection result and the third detection result.

8. The method of claim 7, wherein the sample set overlap detection comprises:

determining the same sample between the evaluation set and the training set;

calculating the ratio between the number of the same samples and the total sample amount of the evaluation set as the overlap degree of the sample set;

based on a magnitude relationship between the sample set overlap and a third threshold, determining whether the evaluation set passes the sample set overlap detection.

9. The method of claim 8, wherein determining the same sample between the evaluation set and the training set comprises:

searching the training set by adopting a search algorithm based on the characteristics extracted for each evaluation sample in the evaluation set to obtain a training sample with highest matching degree with the evaluation sample;

and classifying the evaluation sample into the same sample under the condition that the similarity between the evaluation sample and the searched training sample is larger than a fourth threshold value.

10. The method of claim 9, wherein the extracting of the features comprises:

under the condition that the task type corresponding to the evaluation sample is code generation, code interpretation or code annotation, extracting features based on natural language text in the evaluation sample; or,

and under the condition that the task types corresponding to the evaluation sample are code translation or code completion among different types of programming languages, extracting code fragments in the evaluation sample, analyzing the code fragments into grammar trees, and extracting features of the grammar trees.

11. The method of claim 1, further comprising:

and correcting the evaluation set under the condition that the quality detection result of the evaluation set indicates that the quality detection result does not pass the detection.

12. A code large model-oriented evaluation set quality detection device comprises:

the static detection unit is configured to perform static detection on the evaluation set to obtain a first detection result, and the static detection comprises at least one of the following steps: carrying out integrity detection on each evaluation sample, carrying out sample repeatability detection on an evaluation set, and carrying out sample balance detection on the evaluation set under different task types;

the dynamic detection unit is configured to detect the accuracy of codes by running the code program related to each evaluation sample in the evaluation set to obtain a second detection result;

and the detection result determining unit is configured to determine the quality detection result of the evaluation set at least according to the first detection result and the second detection result.

13. The apparatus of claim 12, further comprising:

the overlapping degree detection unit is configured to detect the overlapping degree of the sample set of the evaluation set and the training set of the code large model to obtain a third detection result;

the detection result determining unit is specifically configured to:

14. A computer readable storage medium having stored thereon a computer program, wherein the computer program, when executed in a computer, causes the computer to perform the method of any of claims 1-11.

15. A computing device comprising a memory and a processor, wherein the memory has executable code stored therein, which when executed by the processor, implements the method of any of claims 1-11.