Data Loss Bug After Upgrading Milvus 2.5 To 2.6

by gitunigon 48 views
Iklan Headers

This article addresses a critical bug encountered after upgrading Milvus from version 2.5 to 2.6, specifically focusing on data loss issues within the streaming component. This issue was identified during a test scenario involving concurrent data operations, highlighting the importance of robust data handling during upgrades. We will delve into the environment setup, the observed behavior, steps to reproduce the issue, and relevant logs to provide a comprehensive understanding of the problem.

Environment Configuration

To provide context, the Milvus environment was configured as follows:

  • Milvus Version: Upgraded from 2.5-20250707-bdfc740a-amd64 to master-20250708-d41eec6f-amd64
  • Deployment Mode: Not specified (Standalone or Cluster)
  • MQ Type: Not specified (RocksMQ, Pulsar, or Kafka)
  • SDK Version: Not specified (e.g., pymilvus v2.0.0rc2)
  • Operating System: Not specified (Ubuntu or CentOS)
  • CPU/Memory: Not specified
  • GPU: Not specified
  • Other: Not specified

Server Configuration

The server configuration included several key settings:

common:
 enabledJsonKeyStats: true
 enabledOptimizeExpr: false
 storage:
 enablev2: false
dataCoord:
 enableActiveStandby: true
 enabledJSONKeyStatsInSort: false
indexCoord:
 enableActiveStandby: true
log:
 level: debug
queryCoord:
 enableActiveStandby: true
rootCoord:
 enableActiveStandby: true

These configurations indicate that JSON key statistics were enabled, active-standby mode was activated for coordination services, and the log level was set to debug. Understanding these configurations is crucial for diagnosing the root cause of the data loss.

Current Behavior: Identifying Data Loss

The primary issue observed was data loss after upgrading from Milvus version 2.5 to 2.6. This was detected through birdwatcher binlog analysis, which revealed discrepancies in the data. The data loss occurred during a scenario involving concurrent operations, specifically after creating a collection, inserting data, flushing, indexing, and loading the data.

Client-Side Operations

The client-side tests involved the following steps:

  1. Creating a Collection: A collection was created with the following schema:

    {
     'auto_id': False,
     'description': '',
     'fields': [
      {'name': 'id', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': False},
      {'name': 'float_vector', 'description': '', 'type': <DataType.FLOAT_VECTOR: 101>, 'params': {'dim': 128}},
      {'name': 'json_1', 'description': '', 'type': <DataType.JSON: 23>},
      {'name': 'array_varchar_1', 'description': '', 'type': <DataType.ARRAY: 22>, 'params': {'max_length': 10, 'max_capacity': 10}, 'element_type': <DataType.VARCHAR: 21>}
     ],
     'enable_dynamic_field': False
    } (base.py:329)
    

    This collection included a primary key (id), a float vector field, a JSON field, and an array of VARCHARs. The complexity of the schema suggests that data loss could be related to the handling of complex data types during the upgrade process. The enable_dynamic_field was set to False.

  2. Data Ingestion and Indexing: The process involved creating an index, inserting 30 million data entries, flushing the data to storage, creating an index again, and finally loading the data. This sequence of operations is typical for preparing data for search and query in Milvus.

  3. Concurrent Requests: Concurrent requests were made, including upserts, flushes, queries, and searches. This concurrent workload was designed to simulate a production environment and expose potential race conditions or data inconsistencies.

The data loss was evident from the birdwatcher binlog analysis, indicating that some data was not persisted correctly during these operations.

Visual Evidence of Data Loss

An image from birdwatcher binlog clearly shows the data loss issue:

Image

This visual evidence underscores the severity of the problem and highlights the need for a thorough investigation.

Expected Behavior: Ensuring Data Integrity

The expected behavior after an upgrade is that all data should be preserved and accessible. Data integrity is paramount, and any data loss is a critical issue that needs immediate attention. The goal is to ensure that the upgrade process is seamless and does not compromise data reliability.

Steps to Reproduce: A Guide for Replicating the Issue

To reproduce this issue, the following steps can be followed:

  1. Set up a Milvus cluster running version 2.5.
  2. Create a collection with the schema described above, including a primary key, float vector, JSON field, and array of VARCHARs.
  3. Insert 30 million data entries into the collection.
  4. Flush the data to storage.
  5. Create an index on the collection.
  6. Upgrade the Milvus cluster to version 2.6.
  7. After the upgrade, initiate concurrent requests including upserts, flushes, queries, and searches.
  8. Monitor the data using birdwatcher binlog or similar tools to check for data loss.

By following these steps, it should be possible to replicate the data loss issue and verify any fixes implemented.

Milvus Logs: A Deep Dive into System Behavior

Milvus logs are crucial for diagnosing issues. The provided information includes a link to an Argo workflow and pod status information after the upgrade.

zong-roll-storage-ups-2-milvus-datanode-98bcbbbbc-74gcl 1/1 Running 0 17h 10.104.19.66 4am-node28 zong-roll-storage-ups-2-milvus-datanode-98bcbbbbc-ssm2r 1/1 Running 0 17h 10.104.27.159 4am-node31 zong-roll-storage-ups-2-milvus-datanode-98bcbbbbc-xx4rd 1/1 Running 0 17h 10.104.33.22 4am-node36 zong-roll-storage-ups-2-milvus-mixcoord-6b89c4459c-mbmw6 1/1 Running 0 17h 10.104.6.42 4am-node13 zong-roll-storage-ups-2-milvus-proxy-6978447969-zckqn 1/1 Running 0 17h 10.104.27.161 4am-node31 zong-roll-storage-ups-2-milvus-querynode-1-6767b99d5b-fn7tn 1/1 Running 0 17h 10.104.15.239 4am-node20 zong-roll-storage-ups-2-milvus-querynode-1-6767b99d5b-tpzwt 1/1 Running 0 17h 10.104.6.43 4am-node13 zong-roll-storage-ups-2-milvus-streamingnode-65c9788f5-66psv 1/1 Running 0 17h 10.104.9.212 4am-node14 zong-roll-storage-ups-2-milvus-streamingnode-65c9788f5-g5bls 1/1 Running 0 17h 10.104.34.183 4am-node37 ```

All pods are in the `Running` state, which indicates that the services are operational. However, this does not rule out underlying issues causing data loss. Detailed logs from these pods, especially the **streaming nodes**, are essential for further analysis. The streaming nodes are particularly relevant because the reported data loss appears to be related to the streaming component.

Further Investigation and Potential Causes

To resolve this issue, a detailed investigation is necessary, focusing on the following areas:

  1. Streaming Node Logs: Examine the logs from the streaming nodes for any error messages, warnings, or anomalies during the upgrade and concurrent operations. Look for indications of write failures, data inconsistencies, or synchronization problems.
  2. Binlog Analysis: Perform a thorough analysis of the binlogs to identify the exact point of data loss and the operations involved. This can help pinpoint whether the issue occurs during data insertion, flushing, or indexing.
  3. Concurrency Handling: Review the concurrency handling mechanisms in Milvus, particularly in the streaming component, to identify potential race conditions or deadlocks that could lead to data loss.
  4. Upgrade Process: Investigate the upgrade process itself to ensure that all data migrations and schema updates are handled correctly. Look for any compatibility issues between versions 2.5 and 2.6 that might affect data integrity.
  5. Data Type Handling: Given that the collection schema includes complex data types like JSON and arrays, it is essential to verify that these types are handled correctly during the upgrade and concurrent operations. There might be issues related to serialization, deserialization, or storage of these types.

Conclusion: Addressing Data Loss in Milvus Upgrades

Data loss after an upgrade is a significant issue that requires a systematic approach to diagnose and resolve. By meticulously examining logs, analyzing binlogs, and reviewing concurrency handling and upgrade processes, it is possible to identify the root cause of the problem. The information presented in this article provides a solid foundation for further investigation and ensures that data integrity is maintained during Milvus upgrades. Addressing this bug is crucial for maintaining user trust and ensuring the reliability of the Milvus platform.

This comprehensive analysis highlights the importance of thorough testing and monitoring during and after upgrades to prevent data loss and maintain the integrity of the Milvus system. Continuous improvement in upgrade processes and data handling mechanisms will further enhance the robustness of Milvus.