[Bug]: Very high p99 with low requests and sufficient resources #34476

syang1997 · 2024-07-08T06:13:42Z

Is there an existing issue for this?

I have searched the existing issues

Environment

- Milvus version: v2.3.15
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka):    pulsar
- SDK version(e.g. pymilvus v2.0.0rc2):  Java sdk v2.3.4
- OS(Ubuntu or CentOS): CentOS
- CPU/Memory: qurey node 8c 16g
- GPU: 
- Others:

Current Behavior

During the smooth request, p99 suddenly increased to 15kms,
but the resources were sufficient and the CPU and memory were low. What was the reason?

The following is the monitoring

The following is the qureynode log

[newVersion=1720315568890138213] [growingSegmentNum=0] [sealedSegmentNum=1]
[2024/07/07 01:26:18.893 +00:00] [INFO] [delegator/distribution.go:294] ["Update readable segment version"] [oldVersion=1720315558893267636] [newVersion=1720315568890093833] [growingSegmentNum=0] [sealedSegmentNum=2]
[2024/07/07 01:26:28.887 +00:00] [INFO] [querynodev2/services.go:1385] ["sync action"] [traceID=5f972dd954fb612893c39b167c22d19c] [collectionID=449980176319159710] [channel=milvus-c-customer-service-rootcoord-dml_4_449980176319159710v0] [currentNodeID=39] [Action=UpdateVersion] [TargetVersion=1720315578895106277]
[2024/07/07 01:26:28.887 +00:00] [INFO] [delegator/distribution.go:294] ["Update readable segment version"] [oldVersion=1720315568889907859] [newVersion=1720315578895106277] [growingSegmentNum=0] [sealedSegmentNum=1]
[2024/07/07 01:26:28.888 +00:00] [INFO] [querynodev2/services.go:1385] ["sync action"] [traceID=f99fdc60cdb2dfa46725f861cad74639] [collectionID=449980176295686266] [channel=milvus-c-customer-service-rootcoord-dml_10_449980176295686266v0] [currentNodeID=39] [Action=UpdateVersion] [TargetVersion=1720315578895383796]
[2024/07/07 01:26:28.888 +00:00] [INFO] [querynodev2/services.go:1385] ["sync action"] [traceID=37a70ecd2a87c776cf6b344bec75bfc4] [collectionID=449980176309032169] [channel=milvus-c-customer-service-rootcoord-dml_15_449980176309032169v0] [currentNodeID=39] [Action=UpdateVersion] [TargetVersion=1720315578894055581]
[2024/07/07 01:26:28.889 +00:00] [INFO] [delegator/distribution.go:294] ["Update readable segment version"] [oldVersion=1720315568890093833] [newVersion=1720315578894055581] [growingSegmentNum=0] [sealedSegmentNum=2]
[2024/07/07 01:26:28.888 +00:00] [INFO] [delegator/distribution.go:294] ["Update readable segment version"] [oldVersion=1720315568890025022] [newVersion=1720315578895383796] [growingSegmentNum=0] [sealedSegmentNum=1]
[2024/07/07 01:26:28.889 +00:00] [INFO] [querynodev2/services.go:1385] ["sync action"] [traceID=70147845a02f2277c5442f83e17b61f5] [collectionID=449980176317610505] [channel=milvus-c-customer-service-rootcoord-dml_3_449980176317610505v0] [currentNodeID=39] [Action=UpdateVersion] [TargetVersion=1720315578894246706]
[2024/07/07 01:26:28.889 +00:00] [INFO] [delegator/distribution.go:294] ["Update readable segment version"] [oldVersion=1720315568889805018] [newVersion=1720315578894246706] [growingSegmentNum=0] [sealedSegmentNum=1]
[2024/07/07 01:26:28.888 +00:00] [INFO] [querynodev2/services.go:1385] ["sync action"] [traceID=a588afc25adcb0b5d29fb43e1eaeb6a3] [collectionID=449980176308582545] [channel=milvus-c-customer-service-rootcoord-dml_13_449980176308582545v0] [currentNodeID=39] [Action=UpdateVersion] [TargetVersion=1720315578895006088]
[2024/07/07 01:26:28.889 +00:00] [INFO] [delegator/distribution.go:294] ["Update readable segment version"] [oldVersion=1720315568890138213] [newVersion=1720315578895006088] [growingSegmentNum=0] [sealedSegmentNum=1]
[2024/07/07 01:26:28.888 +00:00] [INFO] [querynodev2/services.go:1385] ["sync action"] [traceID=dd7f1cc82b6dcd01add1589bb330dec3] [collectionID=449980176316006753] [channel=milvus-c-customer-service-rootcoord-dml_1_449980176316006753v0] [currentNodeID=39] [Action=UpdateVersion] [TargetVersion=1720315578895190401]
[2024/07/07 01:26:28.889 +00:00] [INFO] [delegator/distribution.go:294] ["Update readable segment version"] [oldVersion=1720315568889576391] [newVersion=1720315578895190401] [growingSegmentNum=0] [sealedSegmentNum=1]
[2024/07/07 01:26:28.888 +00:00] [INFO] [querynodev2/services.go:1385] ["sync action"] [traceID=6a896409a6ffd785eac14e3e97cdd30b] [collectionID=450884966347940353] [channel=milvus-c-customer-service-rootcoord-dml_2_450884966347940353v0] [currentNodeID=39] [Action=UpdateVersion] [TargetVersion=1720315578895063008]
[2024/07/07 01:26:28.889 +00:00] [INFO] [delegator/distribution.go:294] ["Update readable segment version"] [oldVersion=1720315568889949859] [newVersion=1720315578895063008] [growingSegmentNum=0] [sealedSegmentNum=1]
[2024/07/07 01:26:28.888 +00:00] [INFO] [querynodev2/services.go:1385] ["sync action"] [traceID=798958dbfb9c98e936505940cacb1e67] [collectionID=449980176314540038] [channel=milvus-c-customer-service-rootcoord-dml_0_449980176314540038v0] [currentNodeID=39]

Expected Behavior

No response

Steps To Reproduce

Problems occur occasionally but cannot be repeated

Milvus Log

No response

Anything else?

No response

The text was updated successfully, but these errors were encountered:

yanliang567 · 2024-07-08T06:40:06Z

quick questions:

are you running all the query nodes with exclusive hardwares underlying?
do you happen to have any metrics about network and disk performance at that moment?
please share the completed milvus logs

/assign @congqixia @syang1997

syang1997 · 2024-07-08T07:15:34Z

quick questions:

are you running all the query nodes with exclusive hardwares underlying?

do you happen to have any metrics about network and disk performance at that moment?

please share the completed milvus logs

/assign @congqixia @syang1997

The k8s cluster is dedicated to milvus, and the underlying node of this milvus qureynode has no fluctuation in cpu usage and resources are available.
The memory index used, and no writes and other operations were performed during the abnormal time period. No obvious abnormalities were found on the network
There are too many logs for all components. I checked qureynode's logs and found no valuable information. Which component logs you need to exclude here, I can intercept the logs for the time period and upload them.

syang1997 · 2024-07-08T07:22:53Z

Node partial monitoring indicators @yanliang567

yanliang567 · 2024-07-08T09:34:04Z

/assign @congqixia
please help to take a look

syang1997 · 2024-07-09T07:37:19Z

@yanliang567 Who can help with the investigation? Recently it appeared again.

yanliang567 · 2024-07-10T05:03:06Z

not have any clues yet. We need more info in logs to know what was happening at that moment. Please offer the full milvus pod logs during the period.

syang1997 · 2024-07-24T02:30:44Z

milvus-log (3).tar.gz
@yanliang567 I got the latest debug log of p99 timeout.
I filtered the log information and found that there is a log that is waiting for search to return to the long log in the proxy log, as follows:

Line 13498: [2024/07/23 15:05:56.323 +00:00] [DEBUG] [proxy/task_search.go:436] ["tr/proxy execute search 451345685543387137"] [traceID=02c29e210100a6076c41537ce36985cb] [msg=done] [duration=3.780114072s]
Line 13511: [2024/07/23 15:05:56.323 +00:00] [DEBUG] [proxy/impl.go:2634] [tr/Search] [traceID=02c29e210100a6076c41537ce36985cb] [msg="wait search result"] [duration=3.780748241s]

But I did not filter a similar query time in the log of QueryNode. It was initially suspected that it was a MQ problem, but the resource usage rate of monitoring and viewing Pulsar was very low, and the nodes at the time were not abnormal.

syang1997 · 2024-07-24T02:31:44Z

@yanliang567 Can you help me analyze the cause of timeout?

yanliang567 · 2024-07-24T02:57:39Z

okay, let me check the logs

bigsheeper · 2024-07-24T03:26:07Z

Hello @syang1997 , could you please provide us the monitoring for wait tsafe latency?

syang1997 · 2024-07-24T03:35:05Z

Hello @syang1997 , could you please provide us the monitoring for wait tsafe latency?

yanliang567 · 2024-07-24T03:35:17Z

Additionally, please attach the metric screenshots around 2024/07/23 15:05(+1h、-1h), it would be helpful for us to address the issue. @syang1997

syang1997 · 2024-07-24T03:37:53Z

@bigsheeper The delay waiting for search is long, but the delay of QueryNode is not long.

syang1997 · 2024-07-24T03:40:07Z

Additionally, please attach the metric screenshots around 2024/07/23 15:05(+1h、-1h), it would be helpful for us to address the issue. @syang1997

Is there any way to export the specified time period? It has been a long time to introduce the log script to choose the 24 -hour log is too large

bigsheeper · 2024-07-24T03:40:17Z

@bigsheeper The delay waiting for search is long, but the delay of QueryNode is not long.

@syang1997 The wait tsafe latency monitoring is like this:

syang1997 · 2024-07-24T03:42:37Z

@bigsheeper The delay waiting for search is long, but the delay of QueryNode is not long.
@syang1997 The wait tsafe latency monitoring is like this:

I didn't find this panel, which version of Grafana Dashboard you used

bigsheeper · 2024-07-24T03:57:34Z

I didn't find this panel, which version of Grafana Dashboard you used

It's ok, if the querynode search request latency is low, then the wait tsafe latency is likely to be low as well.

syang1997 · 2024-07-24T06:02:05Z

@bigsheeper @yanliang567 What more information do you need from me?

congqixia · 2024-07-24T06:30:39Z

@syang1997 quick question, have you detected any network issue during this problem happened?

syang1997 · 2024-07-24T06:41:44Z

quick question, have you detected any network issue during this problem happened?

Judging from the node network monitoring, there is no problem with the network

xiaofan-luan · 2024-07-24T14:20:19Z

what about your host machine cpu?
sometimes if you have some burst on other pod this will also affect your performance.

xiaofan-luan · 2024-07-24T14:21:00Z

you may also check the go garbage collection and K8s throttle info

syang1997 · 2024-07-25T02:05:09Z

what about your host machine cpu? sometimes if you have some burst on other pod this will also affect your performance.

The nodes 'indicators are normal, cpu utilization is very low, and network bandwidth is not high. I also uploaded node monitoring information before

syang1997 · 2024-07-25T02:07:31Z

you may also check the go garbage collection and K8s throttle info

I checked the gc status of the components in milvus monitoring and found no problems. As for k8s resources, there are no oversold or restricted situations

syang1997 · 2024-07-25T02:08:58Z

@yanliang567 My preliminary judgment is that the problem lies with plusar, but plusar's cpu and memory usage are very low. Is it caused by plusar's pvc being exhausted and cleaned up? I haven't checked Pulsar's PVC usage yet.

syang1997 · 2024-07-25T03:02:34Z

Hello @syang1997 , it seems your monitoring dashboard version is quite outdated. Could you please update your dashboard by importing this configuration file? (https://github.com/milvus-io/milvus/blob/master/deployments/monitor/grafana/milvus-dashboard.json) Then, provide us with the updated monitoring information to help us further investigate the issue. Thank you very much.

Okay, I have deployed the latest panel. What indicator information do I need?

bigsheeper · 2024-07-25T03:04:52Z

Okay, I have deployed the latest panel. What indicator information do I need?

@syang1997 Sorry, our colleague is currently submitting a PR to the master branch to update the configuration. We will notify you once the PR is merged.

syang1997 · 2024-07-25T03:08:05Z

Okay, I have deployed the latest panel. What indicator information do I need?

@syang1997 Sorry, our colleague is currently submitting a PR to the master branch to update the configuration. We will notify you once the PR is merged.

Okay, please as soon as possible, because surveillance information is only retained for 7 days

@yanliang567

/kind improvement cc @yanliang567 issue: #34476 Signed-off-by: Edward Zeng <jie.zeng@zilliz.com>

syang1997 · 2024-07-25T08:45:34Z

@bigsheeper I upgraded grafana. What monitoring indicators do I need to provide

bigsheeper · 2024-07-25T08:47:43Z

@bigsheeper I upgraded grafana. What monitoring indicators do I need to provide

@syang1997 It would be best to have all the dashboards for the proxy and querynode.

syang1997 · 2024-07-25T09:00:53Z

@bigsheeper

bigsheeper · 2024-07-25T09:33:02Z

@syang1997 The monitoring looks fine, I still think it might be a network issue between the proxy and the querynode. I meant stability, not throughput related.

bigsheeper · 2024-07-25T09:47:39Z

I found some warning logs that back up my point. These logs appeared during a slow search, and show that the proxy timed out connecting to querynode-53 (address=10.37.113.183:21123).

[2024/07/23 15:05:55.145 +00:00] [WARN] [grpcclient/client.go:408] ["call received grpc error"] [clientRole=querynode-53] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
[2024/07/23 15:05:55.145 +00:00] [WARN] [retry/retry.go:102] ["retry func failed"] [retried=0] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
[2024/07/23 15:05:55.145 +00:00] [WARN] [grpcclient/client.go:554] ["ClientBase Call grpc call get error"] [role=querynode-53] [address=10.37.113.183:21123] [error="stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace\n/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:553 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call\n/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:567 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:88 github.com/milvus-io/milvus/internal/distributed/querynode/client.wrapGrpcCall[...]\n/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:102 github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).GetComponentStates\n/go/src/github.com/milvus-io/milvus/internal/proxy/look_aside_balancer.go:229 github.com/milvus-io/milvus/internal/proxy.(*LookAsideBalancer).checkQueryNodeHealthLoop.func1.1\n/go/src/github.com/milvus-io/milvus/pkg/util/conc/pool.go:81 github.com/milvus-io/milvus/pkg/util/conc.(*Pool[...]).Submit.func1\n/go/pkg/mod/github.com/panjf2000/ants/v2@v2.7.2/worker.go:67 github.com/panjf2000/ants/v2.(*goWorker).run.func1\n/usr/local/go/src/runtime/asm_amd64.s:1598 runtime.goexit: rpc error: code = DeadlineExceeded desc = context deadline exceeded"] [errorVerbose="stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace: rpc error: code = DeadlineExceeded desc = context deadline exceeded\n(1) attached stack trace\n  -- stack trace:\n  | github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call\n  | \t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:553\n  | github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n  | \t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:567\n  | github.com/milvus-io/milvus/internal/distributed/querynode/client.wrapGrpcCall[...]\n  | \t/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:88\n  | github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).GetComponentStates\n  | \t/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:102\n  | github.com/milvus-io/milvus/internal/proxy.(*LookAsideBalancer).checkQueryNodeHealthLoop.func1.1\n  | \t/go/src/github.com/milvus-io/milvus/internal/proxy/look_aside_balancer.go:229\n  | github.com/milvus-io/milvus/pkg/util/conc.(*Pool[...]).Submit.func1\n  | \t/go/src/github.com/milvus-io/milvus/pkg/util/conc/pool.go:81\n  | github.com/panjf2000/ants/v2.(*goWorker).run.func1\n  | \t/go/pkg/mod/github.com/panjf2000/ants/v2@v2.7.2/worker.go:67\n  | runtime.goexit\n  | \t/usr/local/go/src/runtime/asm_amd64.s:1598\nWraps: (2) stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace\n  | /go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:553 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call\n  | /go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:567 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n  | /go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:88 github.com/milvus-io/milvus/internal/distributed/querynode/client.wrapGrpcCall[...]\n  | /go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:102 github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).GetComponentStates\n  | /go/src/github.com/milvus-io/milvus/internal/proxy/look_aside_balancer.go:229 github.com/milvus-io/milvus/internal/proxy.(*LookAsideBalancer).checkQueryNodeHealthLoop.func1.1\n  | /go/src/github.com/milvus-io/milvus/pkg/util/conc/pool.go:81 github.com/milvus-io/milvus/pkg/util/conc.(*Pool[...]).Submit.func1\n  | /go/pkg/mod/github.com/panjf2000/ants/v2@v2.7.2/worker.go:67 github.com/panjf2000/ants/v2.(*goWorker).run.func1\n  | /usr/local/go/src/runtime/asm_amd64.s:1598 runtime.goexit\nWraps: (3) rpc error: code = DeadlineExceeded desc = context deadline exceeded\nError types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *status.Error"]

Please check if there are frequent connection timeout logs with querynode-53 during slow search. If so, I suggest killing this querynode and redeploying it on a different physical machine. @syang1997

syang1997 · 2024-07-25T09:51:05Z

I think the architecture of version 2.4 is Proxy directly to QueryNode, version 2.3 is through MQ. My Milvus version is 2.3.15. @bigsheeper

bigsheeper · 2024-07-25T09:57:42Z

@syang1997 In the 2.3 architecture diagram, "query" refers to the process of the querynode consuming data from the message storage, rather than the query operation itself.

syang1997 · 2024-07-25T11:32:07Z

@syang1997 In the 2.3 architecture diagram, "query" refers to the process of the querynode consuming data from the message storage, rather than the query operation itself.

Yes, the timeout according to the monitoring information is when the process of proxy to querynode rather than the Search operation itself. So the problem of 2.3 is whether the transmission process of the MQ is guessing.

syang1997 · 2024-07-25T11:35:13Z

I found some warning logs that back up my point. These logs appeared during a slow search, and show that the proxy timed out connecting to querynode-53 (address=10.37.113.183:21123).

[2024/07/23 15:05:55.145 +00:00] [WARN] [grpcclient/client.go:408] ["call received grpc error"] [clientRole=querynode-53] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
[2024/07/23 15:05:55.145 +00:00] [WARN] [retry/retry.go:102] ["retry func failed"] [retried=0] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
[2024/07/23 15:05:55.145 +00:00] [WARN] [grpcclient/client.go:554] ["ClientBase Call grpc call get error"] [role=querynode-53] [address=10.37.113.183:21123] [error="stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace\n/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:553 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call\n/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:567 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:88 github.com/milvus-io/milvus/internal/distributed/querynode/client.wrapGrpcCall[...]\n/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:102 github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).GetComponentStates\n/go/src/github.com/milvus-io/milvus/internal/proxy/look_aside_balancer.go:229 github.com/milvus-io/milvus/internal/proxy.(*LookAsideBalancer).checkQueryNodeHealthLoop.func1.1\n/go/src/github.com/milvus-io/milvus/pkg/util/conc/pool.go:81 github.com/milvus-io/milvus/pkg/util/conc.(*Pool[...]).Submit.func1\n/go/pkg/mod/github.com/panjf2000/ants/v2@v2.7.2/worker.go:67 github.com/panjf2000/ants/v2.(*goWorker).run.func1\n/usr/local/go/src/runtime/asm_amd64.s:1598 runtime.goexit: rpc error: code = DeadlineExceeded desc = context deadline exceeded"] [errorVerbose="stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace: rpc error: code = DeadlineExceeded desc = context deadline exceeded\n(1) attached stack trace\n  -- stack trace:\n  | github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call\n  | \t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:553\n  | github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n  | \t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:567\n  | github.com/milvus-io/milvus/internal/distributed/querynode/client.wrapGrpcCall[...]\n  | \t/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:88\n  | github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).GetComponentStates\n  | \t/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:102\n  | github.com/milvus-io/milvus/internal/proxy.(*LookAsideBalancer).checkQueryNodeHealthLoop.func1.1\n  | \t/go/src/github.com/milvus-io/milvus/internal/proxy/look_aside_balancer.go:229\n  | github.com/milvus-io/milvus/pkg/util/conc.(*Pool[...]).Submit.func1\n  | \t/go/src/github.com/milvus-io/milvus/pkg/util/conc/pool.go:81\n  | github.com/panjf2000/ants/v2.(*goWorker).run.func1\n  | \t/go/pkg/mod/github.com/panjf2000/ants/v2@v2.7.2/worker.go:67\n  | runtime.goexit\n  | \t/usr/local/go/src/runtime/asm_amd64.s:1598\nWraps: (2) stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace\n  | /go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:553 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call\n  | /go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:567 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n  | /go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:88 github.com/milvus-io/milvus/internal/distributed/querynode/client.wrapGrpcCall[...]\n  | /go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:102 github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).GetComponentStates\n  | /go/src/github.com/milvus-io/milvus/internal/proxy/look_aside_balancer.go:229 github.com/milvus-io/milvus/internal/proxy.(*LookAsideBalancer).checkQueryNodeHealthLoop.func1.1\n  | /go/src/github.com/milvus-io/milvus/pkg/util/conc/pool.go:81 github.com/milvus-io/milvus/pkg/util/conc.(*Pool[...]).Submit.func1\n  | /go/pkg/mod/github.com/panjf2000/ants/v2@v2.7.2/worker.go:67 github.com/panjf2000/ants/v2.(*goWorker).run.func1\n  | /usr/local/go/src/runtime/asm_amd64.s:1598 runtime.goexit\nWraps: (3) rpc error: code = DeadlineExceeded desc = context deadline exceeded\nError types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *status.Error"]

Please check if there are frequent connection timeout logs with querynode-53 during slow search. If so, I suggest killing this querynode and redeploying it on a different physical machine. @syang1997

@bigsheeper So is the log explanation request directly sent to QueryNode directly from proxy?

bigsheeper · 2024-07-25T11:40:45Z

Yes, the timeout according to the monitoring information is when the process of proxy to querynode rather than the Search operation itself. So the problem of 2.3 is whether the transmission process of the MQ is guessing.

From the monitoring data, the MQ has no affect to the slow search. So no need to suspect the MQ is this issue.

@bigsheeper So is the log explanation request directly sent to QueryNode directly from proxy?

Yep, in milvus, search requests are sent directly from proxy to querynode, so the problem is due to network instability between the proxy and the querynode. @syang1997

syang1997 · 2024-07-25T11:42:01Z

Yes, the timeout according to the monitoring information is when the process of proxy to querynode rather than the Search operation itself. So the problem of 2.3 is whether the transmission process of the MQ is guessing.

From the monitoring data, the MQ has no affect to the slow search. So no need to suspect the MQ is this issue.

@bigsheeper So is the log explanation request directly sent to QueryNode directly from proxy?

Yep, in milvus, search requests are sent directly from proxy to querynode, so the problem is due to network instability between the proxy and the querynode. @syang1997

Okay, I will continue to position in this direction

syang1997 · 2024-07-25T11:49:39Z

@bigsheeper The link about query in the 2.3 architecture diagram is easy to misunderstand. Can you consider optimizing it?

syang1997 added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 8, 2024

syang1997 assigned yanliang567 Jul 8, 2024

yanliang567 added triage/needs-information Indicates an issue needs more information in order to work on it. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 8, 2024

yanliang567 added this to the 2.4.6 milestone Jul 8, 2024

sre-ci-robot assigned congqixia Jul 8, 2024

yanliang567 added help wanted Extra attention is needed and removed kind/bug Issues or changes related a bug triage/needs-information Indicates an issue needs more information in order to work on it. labels Jul 8, 2024

yanliang567 modified the milestones: 2.4.6, 2.4.7 Jul 19, 2024

LoveEachDay mentioned this issue Jul 25, 2024

enhance: Refine milvus dashboard for milvus 2.3 and 2.4 #34994

Merged

sre-ci-robot pushed a commit that referenced this issue Jul 25, 2024

enhance: Refine milvus dashboard for milvus 2.3 and 2.4 (#34994)

95c4600

/kind improvement cc @yanliang567 issue: #34476 Signed-off-by: Edward Zeng <jie.zeng@zilliz.com>

[Bug]: Very high p99 with low requests and sufficient resources #34476

[Bug]: Very high p99 with low requests and sufficient resources #34476

Comments

syang1997 commented Jul 8, 2024

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

Milvus Log

Anything else?

yanliang567 commented Jul 8, 2024

syang1997 commented Jul 8, 2024

syang1997 commented Jul 8, 2024

yanliang567 commented Jul 8, 2024

syang1997 commented Jul 9, 2024

yanliang567 commented Jul 10, 2024

syang1997 commented Jul 24, 2024

syang1997 commented Jul 24, 2024

yanliang567 commented Jul 24, 2024

bigsheeper commented Jul 24, 2024 • edited Loading

syang1997 commented Jul 24, 2024

yanliang567 commented Jul 24, 2024

syang1997 commented Jul 24, 2024

syang1997 commented Jul 24, 2024

bigsheeper commented Jul 24, 2024 • edited Loading

syang1997 commented Jul 24, 2024

bigsheeper commented Jul 24, 2024 • edited Loading

syang1997 commented Jul 24, 2024

congqixia commented Jul 24, 2024

syang1997 commented Jul 24, 2024

xiaofan-luan commented Jul 24, 2024

xiaofan-luan commented Jul 24, 2024

syang1997 commented Jul 25, 2024

syang1997 commented Jul 25, 2024

syang1997 commented Jul 25, 2024

syang1997 commented Jul 25, 2024

bigsheeper commented Jul 25, 2024 • edited Loading

syang1997 commented Jul 25, 2024

syang1997 commented Jul 25, 2024

bigsheeper commented Jul 25, 2024

syang1997 commented Jul 25, 2024

bigsheeper commented Jul 25, 2024 • edited Loading

bigsheeper commented Jul 25, 2024

syang1997 commented Jul 25, 2024

bigsheeper commented Jul 25, 2024 • edited Loading

syang1997 commented Jul 25, 2024

syang1997 commented Jul 25, 2024

bigsheeper commented Jul 25, 2024 • edited Loading

syang1997 commented Jul 25, 2024

syang1997 commented Jul 25, 2024

bigsheeper commented Jul 24, 2024 •

edited

Loading

bigsheeper commented Jul 24, 2024 •

edited

Loading

bigsheeper commented Jul 24, 2024 •

edited

Loading

bigsheeper commented Jul 25, 2024 •

edited

Loading

bigsheeper commented Jul 25, 2024 •

edited

Loading

bigsheeper commented Jul 25, 2024 •

edited

Loading

bigsheeper commented Jul 25, 2024 •

edited

Loading