Detailed Description
Hereinafter, only certain exemplary embodiments are briefly described. As will be recognized by those of skill in the pertinent art, the described embodiments may be modified in numerous different ways without departing from the spirit or scope of the embodiments of the present invention. Accordingly, the drawings and description are to be regarded as illustrative in nature and not as restrictive.
The following disclosure provides many different implementations, or examples, for implementing different configurations of embodiments of the invention. In order to simplify the disclosure of embodiments of the present invention, components and arrangements of specific examples are described below. Of course, they are merely examples and are not intended to limit embodiments of the present invention. Furthermore, embodiments of the present invention may repeat reference numerals and/or letters in the various examples, which are for the purpose of brevity and clarity, and which do not themselves indicate the relationship between the various embodiments and/or arrangements discussed.
Embodiments of the present invention will be described in detail below with reference to the accompanying drawings.
As shown in fig. 1, the present embodiment provides a vector database lifecycle management method based on SQL middleware, including:
S1, receiving and analyzing an extended SQL statement containing a vector life cycle policy statement, and analyzing to obtain policy parameters, wherein the policy parameters comprise a target table, a target vector column, a scheduling rule, a drift threshold and maintenance action parameters;
s2, generating a baseline snapshot, generating a first statistical snapshot according to the self-adaptive sampling rate based on a target vector column, and associating the first statistical snapshot with a strategy parameter and storing the strategy parameter in a strategy catalog;
s3, strategy arrangement, namely selecting a calling sequence corresponding to the maintenance action parameter from a preset API mapping relation table according to the type of the target vector database, and generating a maintenance execution plan;
S4, periodically monitoring, sampling a target vector column according to a scheduling rule, and calculating drift measurement of a current sample and a first statistical snapshot;
s5, judging a threshold value, and triggering a maintenance flow when the drift measurement is larger than the drift threshold value or the index health index exceeds a preset performance threshold value;
s6, performing zero shutdown maintenance on the shadow index, after triggering a maintenance flow, creating the shadow index in a target vector database based on index parameters in a maintenance execution plan, verifying that the recall rate of the shadow index is not lower than a preset recall threshold value through a read flow mirror image, switching the shadow index in an atomic mode to be a main index after verification is passed, and deleting an old index after query flow is reduced to zero or overtime;
S7, the result is durable, and the threshold value judging result, the shadow index switching state and the relevant performance indexes are written into a strategy execution log for audit and tracing.
In some embodiments, the adaptive sampling rate defaults to a sampling probability of 0.01 and dynamically adjusts according to data size, automatically reducing the sampling rate to 0.001 for large tables exceeding 1 million vectors, and conversely increasing the sampling rate to ensure statistical accuracy.
In some embodiments, the method for calculating the drift metric includes calculating cosine divergences of mean vectors for baseline data and current data, or KL divergences after a barreled process of data, or a wasperstein distance by random projection.
In some embodiments, the index health indicator includes a query delay and a fragmentation rate, and the maintenance procedure is triggered when the query delay increases significantly or the fragmentation rate exceeds a preset threshold.
In some embodiments, in the process of creating and switching the shadow indexes, the shadow indexes are built in parallel through the background, the production inquiry is not affected, and the switching time is controlled within 50 ms.
In some embodiments, the maintenance actions include re-indexing, re-embedding, and layering/archiving, where the re-embedding operation vector updates the severely drifting data by invoking an external AI service.
In some embodiments, the policy directory includes a nd_ policies table for recording policy definitions, a nd_ policies _history table for recording policy version changes, and a nd_policy_run table for recording policy execution details.
In some embodiments, the API mapping table records different vector database versions and API call sequences for maintenance actions, supporting Milvus, pinecone, weaviate databases.
In some embodiments, the layering/archiving determines cold and hot data based on time thresholds, access frequency, or custom SQL conditions for the data, migrates the cold data to a low cost store, and maintains a query routing table to support cold data queries.
The vector database life cycle management system based on the SQL middleware, which applies the vector database life cycle management method based on the SQL middleware of any one of the above, comprises the following steps:
The SQL parser is used for parsing the extended SQL statement and outputting strategy parameters;
The strategy catalog is used for storing strategy parameters, baseline snapshot and execution log;
the statistics collector is used for monitoring vector distribution and index health state according to the self-adaptive sampling rate and calculating cosine divergence, KL divergence or Wasserstein distance;
The mapping relation table records API call sequences of different vector database versions and maintenance actions;
The shadow index module is configured to create a shadow index, verify a recall rate and switch atoms under the condition of not interrupting online inquiry;
The scheduler is used for triggering the execution of the statistics collector and the shadow index module according to the scheduling rules in the strategy catalog;
the interaction result of the statistic collector and the shadow index module is written into the strategy catalog by the dispatcher to realize the life cycle management of the closed loop.
The technical conception of the invention is as follows:
The invention extends standard SQL through a middleware platform (NineData) and adds a Data Definition Language (DDL) structure, and declaratively designates a vector life cycle strategy. Policies include, but are not limited to, drift thresholds, index health thresholds, scheduling instructions, and layering/archiving rules.
NineData parse the SQL statement, verify its semantics, store the policy in a directory, periodically execute a monitoring query on one or more vector databases, determine if a predefined threshold is violated, and if so, automatically invoke maintenance operations through a native database Application Program Interface (API) or an external embedded service. All monitoring statistics and operations are persisted to the audit table.
The invention runs in the user space completely, does not need to modify the vector database kernel, provides cross-database compatibility, and supports Milvus, pinecone and other vector databases. The invention realizes the quality, performance and cost management of vector data through a unified SQL control plane.
1. System overview (FIG. 2)
The client sends out SQL sentences through GUI, CLI or API, comprising:
lifecycle management DDL (e.g., CREATE VECTOR LIFECYCLE POLICY);
standard DML (e.g., INSERT), DQL (e.g., SELECT);
vector specific DDL (e.g., creating a vector table), DQL (e.g., similarity search).
The middleware layer includes:
the analyzer and the planner analyze SQL, verify semantics and generate an execution plan;
A strategy catalog for storing strategy definition, snapshot and history record;
A scheduler for managing the policy to be executed at regular time;
A statistics collector for monitoring vector data and index health;
and an operation engine, mapping the strategy to the database API.
Communication-external AI services (e.g., embedded generation) are accessed by SQL dialect or REST/gRPC API communicating with vector databases (e.g., milvus, pinecone) and relational databases.
Use examples:
SQL 1- -create vector table and build HNSW index
CREATE TABLE image_embeddings(
image_id BIGINT PRIMARY KEY,
feature_vector VECTOR(512),
category VARCHAR(100),
-Establishing approximate nearest neighbor index, index parameters in PROPS
INDEX ann_idx(feature_vector)
PROPS(
Index_type= 'HNSW', -index type: HIERARCHICAL NSW
M=32 and, maximum number of neighbors per node
EfConstruction = 200, -build time selection list size
Metric_type= 'IP' -similarity measure: inner Product
)
);
SQL 2-vector similarity search in specified categories
SELECT image_id,category
FROM image_embeddings
WHERE category= 'land scope' -only landscape pictures are retrieved
ORDER BY
vector_search(
feature_vector,
:query_vector_placeholder
)
LIMIT 10;
SQL3 declares a lifecycle policy to rebuild the index monthly
CREATE VECTOR_LIFECYCLE_POLICY monthly_reindex
FOR TABLE image_ embeddings- -target TABLE
ON COLUMN feature _vector- -target vector column
SCHEDLE' 00 1X- - -1 day 00/00 execution per month (CRON)
DETECT DRIFT
USING COSINE _ DIVERGENCE- -drift metric, cosine divergence
DRIFT_THRESHOLD 0.1- -trigger THRESHOLD
ACTION REINDEX_VECTOR_COLUMN
PROPS (- -index parameter used in reconstruction)
M=32, -number of connections, consistent with SQL1
EfConstruction = 256- -build depth slightly higher than SQL1
)。
2. Resolver and planner (Parser & Planner)
The parser and the planner are core components of the middleware and are responsible for processing SQL sentences input by a user, so that the correctness and the executable performance of the SQL sentences are ensured. The present invention will be explained in detail below by taking SQL3 as an example, and the specific flow refers to the SQL parsing and policy transformation flow shown in FIG. 3.
2.1 Recognition and conversion of SQL extensions
The parser uses syntax trees generated based on ANTLR (or other equivalent syntax parsing tool) to distinguish between standard SQL (e.g., SELECT, INSERT) and extended DDL constructs defined by the present invention (e.g., CREATE VECTOR LIFECYCLE POLICY).
The parser first scans the grammar tags (token) in the SQL statement to make a recognition logic determination. When a particular key, such as VECTOR LIFECYCLE POLICY, is detected, the system will route the statement to a dedicated extended DDL processing path for processing. If no such special keywords are detected, the standard SQL statement is passed directly to the underlying database system (e.g., milvus's SQL dialect or MySQL, etc.) for processing.
During parsing, the parser recognizes the CREATE VECTOR LIFECYCLE POLICY statement and parses its clauses (including FOR TABLE, ON COLUMN, schedulable, DETECT DRIFT, ACTION, etc.) into components of the POLICY.
These extended DDL statements are then parsed into internal policy objects during the translation process and into key fields such as policy name policy_name (e.g., monthly _ reindex), target table_name (e.g., image_ embeddings), schedule in the form of CRON expressions (e.g., '0.1 x'), DRIFT THRESHOLD drift_threshold (e.g., 0.1), metric type drift_metric (e.g., COSINE _ DIVERGENCE), and action to be performed (e.g., REINDEX _vector_column). Ultimately, these policy objects are passed to the planner in a structured format (typically JSON) for subsequent processing.
2.2 Semantic verification
Before executing the strategy, the planner can perform comprehensive semantic correctness verification to ensure that the strategy can be correctly applied to the target database. Taking monthly _ reindex strategy in SQL3 as an example, the planner performs a multi-level verification process.
First, in the TABLE verification stage, the planner queries the metadata of the vector database FOR the FOR TABLE image_ embeddings clause, confirms that the image_ embeddings TABLE (set) does exist and contains at least one vector (column) through the describe _collection API of Milvus, and if the TABLE (set) does not exist or does not meet the requirements, the system immediately throws out an error prompt.
Second, in the metric verification link, the planner checks whether COSINE _ DIVERGENCE in the drift_threshold 0.1 USING COSINE_DIVERGENCE belongs to the DRIFT metric types supported by the system (the system also supports other options such as kl_ DIVERGENCE and WASSERSTEIN _distance), and if the specified metric types are not supported, the system provides optional metric type suggestions to the user.
Third, during COLUMN verification, the planner verifies whether the feature_vector is a valid VECTOR (COLUMN) in the image_ embeddings table (set) for the ON COLUMN feature _vector clause, and confirms that the COLUMN supports the ACTION REINDEX _vector_column specified re-indexing operation, e.g., checking whether the COLUMN data type is VECTOR type, and whether the current index state allows rebuilding.
Fourth, during the action verification phase, the planner verifies whether ACTION REINDEX _vector_column actions are applicable to the target database and the specified VECTOR COLUMNs, confirming that the action is available in the current database version.
Finally, during the parameter validation phase, the planner checks if the parameters in the procs (m=32, efconfiguration=256) meet the index requirements of the target database, e.g. confirms that HNSW index Milvus does support M and efConstruction parameters, while other databases such as Pinecone may need to validate different parameter sets, by querying database metadata or using preconfigured adapters to confirm the validity of these parameters.
2.3 Pre-planning/parameterization of operations
The planner is responsible for converting the user-defined declarative actions into a sequence of actual executable operations during processing. This process first involves parameter mapping, i.e., converting the parameters specified by the user in SQL (e.g., procs (m=32, efconfiguration=256)) into the format required by the target vector database native API. Taking Milvus as an example, whose create_ indexAPI requires parameters in JSON format, the planner will automatically infer and generate a structured configuration like { "index_type": "HNSW", "params": { "M":32, "efConstruction":256 }.
In the execution phase, the planner makes decisions based on the current database state. For example, when it is detected that there is already a same type of index in Milvus sets and only parameters need to be adjusted (e.g., modifying the M value), the planner may choose to use modification_ indexAPI for in-place parameter optimization instead of recreating the index.
To support multiple vector databases, the planner maintains a complete database adaptation layer in which the API signatures and specific parameter requirements of various popular vector databases (vector databases adaptable to the present invention include, but are not limited to Milvus, pinecone, etc.) are stored. This design ensures that the user-defined vector lifecycle policy can maintain compatibility in different database environments, enabling true cross-database operational capabilities. The multi-database adaptation layer architecture is shown in fig. 10.
Summarizing, for SQL3:
1. The parser generates a syntax tree.
2. The planner verifies that image_ embeddings and feature_vector exist and validates COSINE _ DIVERGENCE and HNSW parameters.
3. Generating a structured policy object stored as JSON:
{
"policy_name":"monthly_reindex",
"table":"image_embeddings",
"schedule":"0 0 1 * *",
"drift_detection":{
"metric":"COSINE_DIVERGENCE",
"threshold":0.1,
"enabled":true
},
"action":{
"type":"reindex",
"column":"feature_vector",
"params":{"M":32,"efConstruction":256}}}。
3. Strategy Catalog (Policy Catalog 124)
The policy directory provides reliable persistence support, ensuring that all policy definitions and execution histories can be stored securely and are not lost due to system restart or failure. Meanwhile, the strategy catalog also has perfect auditing capability, and through recording complete operation records, compliance requirements of enterprises and supervision are met, so that an administrator can track all strategy changes and execution conditions.
The policy directory is also responsible for maintaining lifecycle state and version information for each policy, enabling the system to accurately track the current state, historical changes, and execution progress of the policies. In addition, the method establishes a close association relationship among the strategy, the snapshot and the execution record to form a complete data chain, thereby facilitating analysis and fault investigation.
In particular implementations, nineData stores the policy directory in a separate set of metadata databases (Meta-Store) (e.g., SQLite) rather than in a managed vector database. Thus, the consistency across multiple vector databases can be ensured, and the requirements of transactions, auditing, backup and high availability can be met.
Through the design, the user can automatically store the data in a lasting mode, schedule the data to be executed according to a plan and provide complete history tracking only through the SQL statement strategy. The policy directory essentially constitutes an infrastructure that implements "declarative vector data lifecycle management" that enables the entire system to operate in a reliable, traceable, and efficient manner.
3.1 Catalog structural design
The policy directory contains the following four core tables:
nd_ policies (policy definition table) store active vector lifecycle policies, fields include:
-policy_id:varchar(50)
-policy_name: varchar (100) -policy name
-table_name:varchar(100)
Column_name: varchar (100) -target column name
-schedule:varchar(50)
-drift_metric:enum('COSINE_DIVERGENCE','KL_DIVERGENCE','WASSER
STEIN_DISTANCE')
-threshold:decimal(10,6)
-action_blob:json
-status:enum('ACTIVE','PAUSED','DISABLED')
-created_time:timestamp
-updated_time:timestamp。
Nd_vector_ snapshots (baseline snapshot table) stores statistics of vector data for drift detection contrast, fields include:
-snapshot_id:varchar(50)
-policy_id: varchar (50) -association policy
-table_name:varchar(100)
-column_name:varchar(100)
-creation_time:timestamp
Vector_count: bigint- -vector quantity
Stats json- -statistical feature
Version, int—snapshot version.
Nd_ policies _history (policy history table) record the version change history of the policy, support audit and rollback.
And nd_policy_run (execution log table) recording the execution details of the strategy, including drift detection results, execution actions and performance indexes.
Further, the relationship between the tables is as follows:
nd_ policies and nd_vector_ snapshots (one-to-many: one policy may have multiple snapshot versions);
nd_ policies and nd_policy_run (one-to-many: one policy has multiple execution records);
nd_vector_ snapshots and nd_policy_run (one-to-many: a snapshot may be referenced by multiple execution records).
When the strategy is created, the system automatically generates baseline snapshots which serve as reference standards for subsequent drift detection, so that the system can accurately identify the change of data distribution. In order to optimize the memory space, the system only holds statistical features, not complete vector data, which significantly reduces the memory overhead, allowing efficient operation even on large vector data sets. In addition, the system also supports the strategy to associate the baseline snapshot of different versions, and the multi-version support can effectively cope with the version difference, so that the management is more flexible, and the system is suitable for the requirements of different business scenes.
4. Statistics collector (STATISTICS COLLECTOR)
The statistics collector is responsible for monitoring the vector data and index states, generating statistics for threshold comparison.
4.1 Vector sample query
The statistics collector adopts a sampling mechanism to ensure that the data distribution change is monitored efficiently and accurately. The system realizes the dynamic sampling rate adjustment function, adopts the sampling probability of 0.01 by default, but can be automatically optimized according to the data scale, wherein for a large table with more than 1 hundred million vectors, the system can automatically reduce the sampling rate to 0.001 to reduce the calculation burden, and for a table with smaller scale, the sampling rate can be increased to ensure the statistical accuracy.
In addition, the system also introduces a cold and hot data layering/archiving sampling strategy, and the data characteristics in the actual use scene are reflected more accurately by analyzing the access frequency metadata (in the case of database support) and preferentially sampling the frequently accessed hot data or respectively sampling the cold and hot data and generating layering/archiving statistical information.
At the query execution level, the system can generate optimized high-efficiency sampling query, fully utilizes a random sampling function (such as random ID query) or a custom user function provided by a bottom database, remarkably reduces data transmission quantity and improves overall sampling efficiency.
4.2 Drift metric calculation
The system provides an intelligent measurement selection mechanism, can automatically select the most suitable drift measurement method according to the characteristics and distribution characteristics of vector data, and also supports a user to explicitly specify the required measurement type in SQL strategy definition. NineData implements an algorithm of drift metrics at the middleware layer by means of a sample vector. Drift detection and metric calculation flow refers to the flow shown in fig. 8 and 9.
For example, COSINE _ DIVERGENCE is typically used as a default metric for text embedding vectors, while WASSERSTEIN _DISTANCE is preferred for data with complex geometric characteristics such as image feature vectors to obtain more accurate drift detection results.
In addition, in order to efficiently store and optimize the strategy, all drift calculation results are cached and persistently stored in the nd_policy_run table, which is convenient for an administrator to analyze historical drift trend and provides precious data support for system performance tuning.
COSINE-DIVERGENCE cosine divergence is a classical vector similarity measure to detect drift by calculating the change in direction of the center of data distribution (mean vector). The system compares the overall distribution direction of the current data and the baseline data, can efficiently detect the shift of the center of gravity of the distribution, and is particularly suitable for drift detection scenes of large-scale data sets.
KL DIVERGENCE KL divergence (Kullback-Leibler divergence) is a classical statistical metric method used to measure the difference between two probability distributions. In the system, it is used to detect changes in the distribution shape of vector data, and is particularly suitable for detecting complex multi-modal distribution changes.
WASSERSTEIN _DISTANCE: wasserstein DISTANCE (also known as Earth Mover IS DISTANCE) is a DISTANCE measure in classical optimal transport theory. In the system, the system is used for measuring geometric differences among vector data distribution, and can effectively detect geometric changes such as spatial displacement, rotation and the like of the data distribution.
4.3 Index of health status
The statistics collectors monitor and evaluate the health of the vector index in a number of ways.
When the database provides an index state API interface (e.g., get_index_state of Milvus), the collector can directly extract key metrics, including direct metrics such as degree of fragmentation and node imbalance, to provide first-hand data for index performance evaluation.
For databases that do not provide a direct API, the system may employ an indirect inference method to infer the health of the index by running a series of test queries (e.g., a random ANN search) to measure query delay and recall (recall is an indicator of the accuracy of the search results, indicating the proportion of relevant results returned to all relevant results). For example, when the query delay increases significantly from the normal 50ms to 150ms, this may be a significant signal of index fragmentation, suggesting that the system requires maintenance operations.
All collected health metrics are stored by the system in the nd_policy_run table, along with drift metric data for threshold evaluation, providing data support for automated maintenance decisions. The comprehensive evaluation mechanism ensures that the system can timely discover and respond to the condition of index performance reduction, and the query efficiency is maintained.
In practical applications, for example, when the system monitors the image_ embeddings table, the collector will generate and execute a sample query such as SELECT feature_ vector FROM image _ embeddings USING PROBABILITY 0.001.001, extracting about 10000 vector samples from the database.
The system then calculates COSINE _ DIVERGENCE values for these samples to be 0.15 from the baseline and writes the result to the nd_policy_run table. At the same time, the system detects that the index lookup delay has reached 130ms, exceeding a preset threshold, which triggers a corresponding maintenance operation, ensuring that system performance remains within acceptable limits.
5. Operation Engine (Action Engine)
The operation engine converts the policy actions into specific operations, adapting different vector databases.
5.1 Operation mapping and adaptation layer
The operation engine realizes the intelligent mapping from the general action to the specific database API through the adaptation layer. The adaptation layer is responsible for converting abstract operational instructions into database specific API calls, such as converting generic REINDEX _vector_column instructions seamlessly into either the create_index function of Milvus or the create_index/delete_index interface of Pinecone.
For example, milvus require parameters of:
{
"collection_name":"image_embeddings",
"field_name": "feature_vector",
"index_type": "HNSW",
"metric_type":"L2",
"params":{"M":32,"efConstruction":256}
}。
Based on the identified database type, the adapter will generate a corresponding API call sequence, for example, for Milvus databases, the system will build a JSON structure containing the set name, field name, index type, and parameters. In order to ensure the wide compatibility of the system, the adaptation layer maintains detailed database version information and an API signature library and supports various mainstream vector databases including Milvus, pinecone, weaviate, so that the cross-database compatibility execution capability of the operation is realized.
5.2 Re-embedding (Re-embedding)
In the re-embedding process, the engine first selects the data to be processed according to the policy conditions. When the drift threshold is triggered, the system will prioritize those data subsets that drift most severely while also supporting full re-embedding of the entire TABLE via the ACTION REEMBED _table instruction.
The re-embedding operation is typically triggered automatically by a preset policy, but the middleware also provides an SQL interface and internal mechanisms to manually invoke the embedded service. During the execution of the automation policy, the system may implement data update through SQL-like operations as follows.
-Re-embedding
UPDATE image_embeddings
SETfeature_vector=EMBED_TEXT(source_text_column,'text-embedding-model-v3',768)
WHERE<condition_for_reembedding>。
In an actual automation strategy, the engine directly interacts with the external AI service according to instructions such as ACTION REEMBED _ DATA USING SERVICE 'external_ embedding _service_v2', and obtains new vector embeddings in batches.
In addition, in order to improve processing efficiency, the engine adopts a batch calling mechanism to send the vectors to the embedded service (such as OpenAI/ALIYUN API) in batches, and can dynamically adjust the batch size according to the system load, and the default setting is 1000 vectors in each batch. And details all failure records into the nd_policy_run log.
Finally, all newly generated vectors are written into the database through UPSERT operation, so that data consistency and integrity are ensured.
5.3 Re-indexing (Re-indexing)
Shadow index rebuilding allows vector indexes to be updated and optimized without interrupting production queries.
First, the system enters a preparation phase, decides to use shadow rebuilding according to policy parameters (e.g., procs (m=32, efconfiguration=256, shadow_copy=true)), and generates a unique identifier, e.g., feature_vector_index_shadow_20241109135524, for the new index. Resource reservation is then performed, the resources required for index rebuilding are evaluated, sufficient disk space and memory are ensured, and for large indexes dedicated nodes may be allocated or scheduled for execution during low peak periods.
The index creation phase creates an index API through a database adapter call, but is not immediately associated with a table. For example, call style of Milvus API is generated:
create_index(
collection_name="image_embeddings",
field_name="feature_vector",
index_name="feature_vector_index_shadow_20241109135524",
params={"index_type":"HNSW","params":{"M":32,"efConstruction":
256}}
)。
the index switching is performed next. For other databases, a classical shadow indexing mechanism (creating a same new index beside the existing index, which exist in parallel, and performing atomic switching after verification is completed) is adopted, so that query service is not interrupted in the switching process, and the switching time is controlled within 50 ms.
And finally, carrying out subsequent cleaning, after successful switching, waiting for the completion of the query of the old index (monitoring the number of active queries to zero or overtime), deleting the old index in the low peak period, and releasing the resource. The shadow indexing method shortens the maintenance window from the hour level to the minute level, eliminates the downtime perceived by a user, and maintains service availability and query performance in a high-load environment. The shadow index zero shutdown maintenance flow refers to the flows shown in figures 4, 5,6 and 7.
It should be noted that after the shadow index is built, the middleware will send a small portion (e.g., 1%) of the online read-only query request to both the primary and shadow indexes. Then, the system compares the returned results (such as the coincidence ratio of the Top-K list) and calculates the recall ratio of the shadow index relative to the main index, so as to ensure that the recall ratio is not lower than a preset threshold (for example, 99%), and then the system executes switching.
5.4 Layering/archiving
The layering/archiving functionality supports cold and hot data management based on data characteristics. The system supports a variety of conditions in terms of cold and hot decisions, including time thresholds (e.g., a "older _than=6 montans" parameter may be set), access frequency (inferred by analyzing database access logs or middleware behavior), and custom SQL conditions (e.g., "WHERE creation_date < '2024-01-01'").
On the migration mechanism, the system optimizes cold and hot data layering/archiving based on target database capabilities, moving vector data (and optional index) that is not commonly accessed to low cost storage (e.g., amazonS 3). If the database supports the migration of the primary cold and hot partition (such as move_ partitionAPI of Weaviate), the interface is directly called to complete the migration. For databases that do not support native migration (e.g., milvus), the system simulates migration by 1) querying cold data (based on conditions such as creation_date < '> 2024-01-01') and copying to low cost storage, 2) updating aggregate metadata, recording low cost storage paths for the cold data, and 3) deleting the original cold data, freeing up high cost storage space. To support cold data queries, the system maintains a query routing table, dynamically redirecting queries to low cost storage or pre-loading to high performance storage (if real-time ANN searching is required). Migration operations are scheduled in low peak periods, with time overhead dependent on the amount of data and network bandwidth (e.g., 100 kilowatts takes about 5-10 minutes). Metadata updates are atomic operations, ensuring data consistency. If index migration is involved, the index file is stored with the cold data to a low cost storage, and when in use, the index file needs to be loaded back to the local storage to restore the query performance.
6. Scheduler (Scheduler)
The scheduler is responsible for managing the timing execution of the policies, ensuring the robustness and efficiency of the overall system operation. In terms of task management, the scheduler can parse a timing expression such as SCHEDULE CRON' 0.1, calculate the next run time and store it in the memory priority queue. At the same time, the scheduler sets a maximum run time limit (e.g., 1 hour) for each task, and automatically pauses and records error information when the task times out. For tasks that fail due to API timeout and the like, the system supports an automatic retry mechanism, retries 3 times by default and adopts an exponential backoff strategy.
In concurrency control, the scheduler orders the tasks according to policy-defined priorities (as specified by WITH PRIORITY HIGH). To avoid database overload, the system limits the number of concurrency tasks (10 by default) and the engine can dynamically adjust concurrency based on the load metrics of the database (e.g., CPU utilization). When a plurality of strategies aim at the same table, the scheduler can adopt a mode of serialization execution, so that the lock contention problem is effectively prevented.
For example, the scheduler may trigger a task named monthly _ reindex at 00:00 per month. If the load of the Milvus database is high at this time, the scheduler delays executing the task, and records the delay reason in the nd_policy_run log table in detail, so as to ensure the transparency and traceability of the system operation.
7. Auditing and security
The system provides an audit function and all policy modifications are recorded in the nd_ policies _history table, keeping history versions and change records. Meanwhile, each execution condition of the strategy, including detailed information of success and failure, is recorded in a nd_policy_run table, and the log records support enterprise compliance requirements, so that the tracing and problem investigation are facilitated.
In terms of security, the system implements a rights control mechanism. The user must possess ADMIN rights to create or modify the vector lifecycle policy, which control is enforced by the role-based access control (RBAC) system of the database. In addition, all SQL sentences are subjected to parameterized analysis processing before being executed, so that SQL injection attacks are prevented. And complete Schema verification & Escaping is carried out on the incoming JSON, then the API is spliced, and the situation that string-concatemers in the JSON are mapped to the underlying REST call and secondary injection risks are introduced is avoided.
8. Alternative embodiment
The present system supports a variety of flexible embodiments. In terms of scheduling, in addition to built-in schedulers, the system may be integrated with external scheduling systems such as Kubernetes CronJob or APACHE AIR flow, where SQL statements are used only for policy declarations and execution scheduling is taken care of by these specialized tools.
For a multi-tenant scene, the system designs a directory table structure isolated according to tenant IDs, fully supports SaaS mode deployment and ensures mutual isolation of data and strategies among different tenants.
In the aspect of drift measurement, the system not only provides a built-in measurement method, but also supports the expansion of a custom measurement standard through a user-defined function (UDF) so as to meet the requirements of specific business scenes.
The system also supports federal query capability, allowing federation policies to be enforced across different vector databases and relational databases, e.g., milvus and MySQL databases may be federated simultaneously, implementing more complex data management policies.
The core innovation points comprise:
1. declarative SQL extensions:
the standard SQL is extended, and a new command such as CREATE VECTOR LIFECYCLE _policy is added.
The user declares the management policy in simple SQL, for example:
"detecting product vector drift exceeding 10%, automatically optimizing index, and ensuring zero shutdown by using shadow index";
"periodic checking index health, automatic rebuilding inefficient index";
"data that has not been accessed for more than 6 months," moves to low cost storage (hot and cold stratification/archiving) ".
2. Intelligent adaptation:
the middleware automatically translates the general SQL strategy into the native instructions of different databases, and users do not need to care about the underlying differences.
3. Non-invasive:
the vector database kernel is not required to be modified, and the new database or the existing system is easy to adapt.
Major points of value and innovation:
The threshold is lowered, and a developer manages the complex vector database by using familiar SQL without learning a proprietary API.
Standardized and cross-platform unified management interfaces, and are compatible with various vector databases.
And (3) auditing that the operation records meet the compliance requirements.
The declarative advantage is that the user only needs to define the target state and does not need to write complex scripts.
And the middleware value is that the application and the database are decoupled, and flexibility and expansibility are provided.
The invention is different from the prior art
Non-simple SQL queries the prior art may support SQL query vector databases, but the present invention uses SQL to define and automate the overall lifecycle management strategy.
Technical details of vector drift metric method:
Note that these metrics methods are not themselves the core innovation of the present invention, but are integrated into the declarative SQL middleware system as prior art for implementing automated vector data drift detection functions. I.e., where the following description is not clear, reference is made to the related art.
A.1 cosine spread (COSINE _ DIVERGENCE)
Mathematical definition:
COSINE_DIVERGENCE(,)=1-cos(θ)=1-(·)/(||||×||||);
Wherein: Mean vectors representing baseline data; And cos (theta) is the cosine similarity between the two mean vectors.
The calculation flow is as follows:
1. Obtaining mean vectors from baseline snapshots ;
2. Calculating the mean vector of the current sampled data=(1/n)Σ;
3. Calculating a cosine divergence value;
4. and comparing the drift value with a preset threshold value to judge whether drift occurs.
Code example:
python
baseline_mean=get_baseline_mean_vector(snapshot_id)
current_mean=compute_mean_vector(current_samples)
cosine_sim=dot_product(baseline_mean,current_mean)/(norm(baseline_mean)*norm(current_mean));
divergence=1-cosine_sim。
A.2KL divergence (KL_ DIVERGENCE)
Mathematical definition:
KL(P||Q)=∫p(x)log(p(x)/q(x))dx;
For high-dimensional vectors, the approximation is calculated by dimension reduction or distribution fitting:
KL_approx(P||Q)=∑ip()log(p()/q());
The calculation steps are as follows:
1. barrel-dividing the baseline and current vector data (such as PCA (principal component analysis) dimension reduction and barrel-dividing);
2. Estimating probability density p of each barrel ) And q is%);
3. Calculating a discretized KL divergence value;
4. zero probability cases are handled using smoothing techniques.
The application scene is as follows:
Detecting a change in the multimodal distribution;
A significant change in the distribution shape was found;
Fit a medium-scale vector set (avoid curse of dimensions).
A.3Wasserstein DISTANCE (WASSERSTEIN _DISTANCE)
Mathematical definition (Wasserstein distance 1):
(P,Q)=inf_{γ∈Γ(P,Q)}∫||x-y||dγ(x,y);
the approximation calculation method adopts SLICED WASSERSTEIN DISTANCE for the high-dimensional vector:
SWD(P,Q)=∫_{S^{d-1}}(P_θ,Q_θ)dθ;
The implementation algorithm comprises the following steps:
1. randomly selecting a plurality of projection directions theta;
2. Projecting the vector into a one-dimensional space;
3. Calculating 1-dimensional Wasserstein distances of the distribution after projection;
4. All projection directions are averaged.
Code example:
python
def sliced_wasserstein_distance(X,Y,num_projections=100):
distances=[]
for _ in range(num_projections):
random generation of projection directions
direction=random_unit_vector(dim)
Projection #
X_proj=X@direction
Y_proj=Y@direction
Calculation of 1DWASSERSTEIN distance #
distances.append(wasserstein_1d(X_proj,Y_proj))
return np.mean(distances)。
A.4 selection guideline for metrology
COSINE _ DIVERGENCE applicable cases:
large-scale dataset (> 10 6 vectors);
The shift of the overall distribution center is mainly focused;
Computing a resource constrained scenario;
kl_ DIVERGENCE application case:
medium-scale dataset (10 4-106 vectors);
The change of the distribution shape needs to be detected;
multimodal data distribution;
WASSERSTEIN _DISTANCE applicable:
Focusing on the spatial geometry change;
Detecting translation and rotation of the distribution;
a scene with a certain robustness to noise.
A.5 threshold setting suggestions
Based on experimental experience, a suggested default threshold:
COSINE _ DIVERGENCE:0.05-0.15 (depending on the application scenario);
kl_ DIVERGENCE:0.1-0.5 (need to be adjusted according to data characteristics);
WASSERSTEIN _DISTANCE, 10% -30% of data standard deviation.