Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Very high p99 with low requests and sufficient resources #34476

Open
1 task done
syang1997 opened this issue Jul 8, 2024 · 40 comments
Open
1 task done

[Bug]: Very high p99 with low requests and sufficient resources #34476

syang1997 opened this issue Jul 8, 2024 · 40 comments
Assignees
Labels
help wanted Extra attention is needed
Milestone

Comments

@syang1997
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: v2.3.15
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka):    pulsar
- SDK version(e.g. pymilvus v2.0.0rc2):  Java sdk v2.3.4
- OS(Ubuntu or CentOS): CentOS
- CPU/Memory: qurey node 8c 16g
- GPU: 
- Others:

Current Behavior

During the smooth request, p99 suddenly increased to 15kms,
but the resources were sufficient and the CPU and memory were low. What was the reason?

The following is the monitoring

img
img_1
img_2
img_3
img_4

The following is the qureynode log

[newVersion=1720315568890138213] [growingSegmentNum=0] [sealedSegmentNum=1]
[2024/07/07 01:26:18.893 +00:00] [INFO] [delegator/distribution.go:294] ["Update readable segment version"] [oldVersion=1720315558893267636] [newVersion=1720315568890093833] [growingSegmentNum=0] [sealedSegmentNum=2]
[2024/07/07 01:26:28.887 +00:00] [INFO] [querynodev2/services.go:1385] ["sync action"] [traceID=5f972dd954fb612893c39b167c22d19c] [collectionID=449980176319159710] [channel=milvus-c-customer-service-rootcoord-dml_4_449980176319159710v0] [currentNodeID=39] [Action=UpdateVersion] [TargetVersion=1720315578895106277]
[2024/07/07 01:26:28.887 +00:00] [INFO] [delegator/distribution.go:294] ["Update readable segment version"] [oldVersion=1720315568889907859] [newVersion=1720315578895106277] [growingSegmentNum=0] [sealedSegmentNum=1]
[2024/07/07 01:26:28.888 +00:00] [INFO] [querynodev2/services.go:1385] ["sync action"] [traceID=f99fdc60cdb2dfa46725f861cad74639] [collectionID=449980176295686266] [channel=milvus-c-customer-service-rootcoord-dml_10_449980176295686266v0] [currentNodeID=39] [Action=UpdateVersion] [TargetVersion=1720315578895383796]
[2024/07/07 01:26:28.888 +00:00] [INFO] [querynodev2/services.go:1385] ["sync action"] [traceID=37a70ecd2a87c776cf6b344bec75bfc4] [collectionID=449980176309032169] [channel=milvus-c-customer-service-rootcoord-dml_15_449980176309032169v0] [currentNodeID=39] [Action=UpdateVersion] [TargetVersion=1720315578894055581]
[2024/07/07 01:26:28.889 +00:00] [INFO] [delegator/distribution.go:294] ["Update readable segment version"] [oldVersion=1720315568890093833] [newVersion=1720315578894055581] [growingSegmentNum=0] [sealedSegmentNum=2]
[2024/07/07 01:26:28.888 +00:00] [INFO] [delegator/distribution.go:294] ["Update readable segment version"] [oldVersion=1720315568890025022] [newVersion=1720315578895383796] [growingSegmentNum=0] [sealedSegmentNum=1]
[2024/07/07 01:26:28.889 +00:00] [INFO] [querynodev2/services.go:1385] ["sync action"] [traceID=70147845a02f2277c5442f83e17b61f5] [collectionID=449980176317610505] [channel=milvus-c-customer-service-rootcoord-dml_3_449980176317610505v0] [currentNodeID=39] [Action=UpdateVersion] [TargetVersion=1720315578894246706]
[2024/07/07 01:26:28.889 +00:00] [INFO] [delegator/distribution.go:294] ["Update readable segment version"] [oldVersion=1720315568889805018] [newVersion=1720315578894246706] [growingSegmentNum=0] [sealedSegmentNum=1]
[2024/07/07 01:26:28.888 +00:00] [INFO] [querynodev2/services.go:1385] ["sync action"] [traceID=a588afc25adcb0b5d29fb43e1eaeb6a3] [collectionID=449980176308582545] [channel=milvus-c-customer-service-rootcoord-dml_13_449980176308582545v0] [currentNodeID=39] [Action=UpdateVersion] [TargetVersion=1720315578895006088]
[2024/07/07 01:26:28.889 +00:00] [INFO] [delegator/distribution.go:294] ["Update readable segment version"] [oldVersion=1720315568890138213] [newVersion=1720315578895006088] [growingSegmentNum=0] [sealedSegmentNum=1]
[2024/07/07 01:26:28.888 +00:00] [INFO] [querynodev2/services.go:1385] ["sync action"] [traceID=dd7f1cc82b6dcd01add1589bb330dec3] [collectionID=449980176316006753] [channel=milvus-c-customer-service-rootcoord-dml_1_449980176316006753v0] [currentNodeID=39] [Action=UpdateVersion] [TargetVersion=1720315578895190401]
[2024/07/07 01:26:28.889 +00:00] [INFO] [delegator/distribution.go:294] ["Update readable segment version"] [oldVersion=1720315568889576391] [newVersion=1720315578895190401] [growingSegmentNum=0] [sealedSegmentNum=1]
[2024/07/07 01:26:28.888 +00:00] [INFO] [querynodev2/services.go:1385] ["sync action"] [traceID=6a896409a6ffd785eac14e3e97cdd30b] [collectionID=450884966347940353] [channel=milvus-c-customer-service-rootcoord-dml_2_450884966347940353v0] [currentNodeID=39] [Action=UpdateVersion] [TargetVersion=1720315578895063008]
[2024/07/07 01:26:28.889 +00:00] [INFO] [delegator/distribution.go:294] ["Update readable segment version"] [oldVersion=1720315568889949859] [newVersion=1720315578895063008] [growingSegmentNum=0] [sealedSegmentNum=1]
[2024/07/07 01:26:28.888 +00:00] [INFO] [querynodev2/services.go:1385] ["sync action"] [traceID=798958dbfb9c98e936505940cacb1e67] [collectionID=449980176314540038] [channel=milvus-c-customer-service-rootcoord-dml_0_449980176314540038v0] [currentNodeID=39] 

Expected Behavior

No response

Steps To Reproduce

Problems occur occasionally but cannot be repeated

Milvus Log

No response

Anything else?

No response

@syang1997 syang1997 added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 8, 2024
@yanliang567
Copy link
Contributor

quick questions:

  1. are you running all the query nodes with exclusive hardwares underlying?
  2. do you happen to have any metrics about network and disk performance at that moment?
  3. please share the completed milvus logs

/assign @congqixia @syang1997

@yanliang567 yanliang567 added triage/needs-information Indicates an issue needs more information in order to work on it. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 8, 2024
@yanliang567 yanliang567 added this to the 2.4.6 milestone Jul 8, 2024
@syang1997
Copy link
Author

quick questions:

  1. are you running all the query nodes with exclusive hardwares underlying?
  2. do you happen to have any metrics about network and disk performance at that moment?
  3. please share the completed milvus logs

/assign @congqixia @syang1997

  1. The k8s cluster is dedicated to milvus, and the underlying node of this milvus qureynode has no fluctuation in cpu usage and resources are available.
  2. The memory index used, and no writes and other operations were performed during the abnormal time period. No obvious abnormalities were found on the network
  3. There are too many logs for all components. I checked qureynode's logs and found no valuable information. Which component logs you need to exclude here, I can intercept the logs for the time period and upload them.
@syang1997
Copy link
Author

Node partial monitoring indicators @yanliang567
img_5
img_6

@yanliang567
Copy link
Contributor

/assign @congqixia
please help to take a look

@yanliang567 yanliang567 added help wanted Extra attention is needed and removed kind/bug Issues or changes related a bug triage/needs-information Indicates an issue needs more information in order to work on it. labels Jul 8, 2024
@syang1997
Copy link
Author

@yanliang567 Who can help with the investigation? Recently it appeared again.

@yanliang567
Copy link
Contributor

not have any clues yet. We need more info in logs to know what was happening at that moment. Please offer the full milvus pod logs during the period.

@yanliang567 yanliang567 modified the milestones: 2.4.6, 2.4.7 Jul 19, 2024
@syang1997
Copy link
Author

milvus-log (3).tar.gz
@yanliang567 I got the latest debug log of p99 timeout.
I filtered the log information and found that there is a log that is waiting for search to return to the long log in the proxy log, as follows:

Line 13498: [2024/07/23 15:05:56.323 +00:00] [DEBUG] [proxy/task_search.go:436] ["tr/proxy execute search 451345685543387137"] [traceID=02c29e210100a6076c41537ce36985cb] [msg=done] [duration=3.780114072s]
Line 13511: [2024/07/23 15:05:56.323 +00:00] [DEBUG] [proxy/impl.go:2634] [tr/Search] [traceID=02c29e210100a6076c41537ce36985cb] [msg="wait search result"] [duration=3.780748241s]

But I did not filter a similar query time in the log of QueryNode. It was initially suspected that it was a MQ problem, but the resource usage rate of monitoring and viewing Pulsar was very low, and the nodes at the time were not abnormal.

@syang1997
Copy link
Author

@yanliang567 Can you help me analyze the cause of timeout?

@yanliang567
Copy link
Contributor

okay, let me check the logs

@bigsheeper
Copy link
Contributor

bigsheeper commented Jul 24, 2024

Hello @syang1997 , could you please provide us the monitoring for wait tsafe latency?

@syang1997
Copy link
Author

Hello @syang1997 , could you please provide us the monitoring for wait tsafe latency?

image image
@yanliang567
Copy link
Contributor

Additionally, please attach the metric screenshots around 2024/07/23 15:05(+1h、-1h), it would be helpful for us to address the issue. @syang1997

@syang1997
Copy link
Author

@bigsheeper The delay waiting for search is long, but the delay of QueryNode is not long.

@syang1997
Copy link
Author

Additionally, please attach the metric screenshots around 2024/07/23 15:05(+1h、-1h), it would be helpful for us to address the issue. @syang1997

Is there any way to export the specified time period? It has been a long time to introduce the log script to choose the 24 -hour log is too large

@bigsheeper
Copy link
Contributor

bigsheeper commented Jul 24, 2024

@bigsheeper The delay waiting for search is long, but the delay of QueryNode is not long.

@syang1997 The wait tsafe latency monitoring is like this:

image

@syang1997
Copy link
Author

@bigsheeper The delay waiting for search is long, but the delay of QueryNode is not long.
@syang1997 The wait tsafe latency monitoring is like this:

image

I didn't find this panel, which version of Grafana Dashboard you used

@bigsheeper
Copy link
Contributor

bigsheeper commented Jul 24, 2024

I didn't find this panel, which version of Grafana Dashboard you used

It's ok, if the querynode search request latency is low, then the wait tsafe latency is likely to be low as well.

@syang1997
Copy link
Author

@bigsheeper @yanliang567 What more information do you need from me?

@congqixia
Copy link
Contributor

@syang1997 quick question, have you detected any network issue during this problem happened?

@syang1997
Copy link
Author

quick question, have you detected any network issue during this problem happened?

Judging from the node network monitoring, there is no problem with the network

@xiaofan-luan
Copy link
Contributor

what about your host machine cpu?
sometimes if you have some burst on other pod this will also affect your performance.

@xiaofan-luan
Copy link
Contributor

you may also check the go garbage collection and K8s throttle info

@syang1997
Copy link
Author

what about your host machine cpu? sometimes if you have some burst on other pod this will also affect your performance.

The nodes 'indicators are normal, cpu utilization is very low, and network bandwidth is not high. I also uploaded node monitoring information before

@syang1997
Copy link
Author

you may also check the go garbage collection and K8s throttle info

I checked the gc status of the components in milvus monitoring and found no problems. As for k8s resources, there are no oversold or restricted situations

@syang1997
Copy link
Author

@yanliang567 My preliminary judgment is that the problem lies with plusar, but plusar's cpu and memory usage are very low. Is it caused by plusar's pvc being exhausted and cleaned up? I haven't checked Pulsar's PVC usage yet.

@syang1997
Copy link
Author

Hello @syang1997 , it seems your monitoring dashboard version is quite outdated. Could you please update your dashboard by importing this configuration file? (https://github.com/milvus-io/milvus/blob/master/deployments/monitor/grafana/milvus-dashboard.json) Then, provide us with the updated monitoring information to help us further investigate the issue. Thank you very much.

Okay, I have deployed the latest panel. What indicator information do I need?

@bigsheeper
Copy link
Contributor

bigsheeper commented Jul 25, 2024

Okay, I have deployed the latest panel. What indicator information do I need?

@syang1997 Sorry, our colleague is currently submitting a PR to the master branch to update the configuration. We will notify you once the PR is merged.

@syang1997
Copy link
Author

Okay, I have deployed the latest panel. What indicator information do I need?

@syang1997 Sorry, our colleague is currently submitting a PR to the master branch to update the configuration. We will notify you once the PR is merged.

Okay, please as soon as possible, because surveillance information is only retained for 7 days

sre-ci-robot pushed a commit that referenced this issue Jul 25, 2024
/kind improvement
cc @yanliang567 
issue: #34476

Signed-off-by: Edward Zeng <jie.zeng@zilliz.com>
@syang1997
Copy link
Author

@bigsheeper I upgraded grafana. What monitoring indicators do I need to provide

@bigsheeper
Copy link
Contributor

@bigsheeper I upgraded grafana. What monitoring indicators do I need to provide

@syang1997 It would be best to have all the dashboards for the proxy and querynode.

@syang1997
Copy link
Author

@bigsheeper
Copy link
Contributor

bigsheeper commented Jul 25, 2024

@syang1997 The monitoring looks fine, I still think it might be a network issue between the proxy and the querynode. I meant stability, not throughput related.

@bigsheeper
Copy link
Contributor

I found some warning logs that back up my point. These logs appeared during a slow search, and show that the proxy timed out connecting to querynode-53 (address=10.37.113.183:21123).

[2024/07/23 15:05:55.145 +00:00] [WARN] [grpcclient/client.go:408] ["call received grpc error"] [clientRole=querynode-53] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
[2024/07/23 15:05:55.145 +00:00] [WARN] [retry/retry.go:102] ["retry func failed"] [retried=0] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
[2024/07/23 15:05:55.145 +00:00] [WARN] [grpcclient/client.go:554] ["ClientBase Call grpc call get error"] [role=querynode-53] [address=10.37.113.183:21123] [error="stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace\n/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:553 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call\n/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:567 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:88 github.com/milvus-io/milvus/internal/distributed/querynode/client.wrapGrpcCall[...]\n/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:102 github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).GetComponentStates\n/go/src/github.com/milvus-io/milvus/internal/proxy/look_aside_balancer.go:229 github.com/milvus-io/milvus/internal/proxy.(*LookAsideBalancer).checkQueryNodeHealthLoop.func1.1\n/go/src/github.com/milvus-io/milvus/pkg/util/conc/pool.go:81 github.com/milvus-io/milvus/pkg/util/conc.(*Pool[...]).Submit.func1\n/go/pkg/mod/github.com/panjf2000/ants/v2@v2.7.2/worker.go:67 github.com/panjf2000/ants/v2.(*goWorker).run.func1\n/usr/local/go/src/runtime/asm_amd64.s:1598 runtime.goexit: rpc error: code = DeadlineExceeded desc = context deadline exceeded"] [errorVerbose="stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace: rpc error: code = DeadlineExceeded desc = context deadline exceeded\n(1) attached stack trace\n  -- stack trace:\n  | github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call\n  | \t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:553\n  | github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n  | \t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:567\n  | github.com/milvus-io/milvus/internal/distributed/querynode/client.wrapGrpcCall[...]\n  | \t/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:88\n  | github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).GetComponentStates\n  | \t/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:102\n  | github.com/milvus-io/milvus/internal/proxy.(*LookAsideBalancer).checkQueryNodeHealthLoop.func1.1\n  | \t/go/src/github.com/milvus-io/milvus/internal/proxy/look_aside_balancer.go:229\n  | github.com/milvus-io/milvus/pkg/util/conc.(*Pool[...]).Submit.func1\n  | \t/go/src/github.com/milvus-io/milvus/pkg/util/conc/pool.go:81\n  | github.com/panjf2000/ants/v2.(*goWorker).run.func1\n  | \t/go/pkg/mod/github.com/panjf2000/ants/v2@v2.7.2/worker.go:67\n  | runtime.goexit\n  | \t/usr/local/go/src/runtime/asm_amd64.s:1598\nWraps: (2) stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace\n  | /go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:553 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call\n  | /go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:567 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n  | /go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:88 github.com/milvus-io/milvus/internal/distributed/querynode/client.wrapGrpcCall[...]\n  | /go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:102 github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).GetComponentStates\n  | /go/src/github.com/milvus-io/milvus/internal/proxy/look_aside_balancer.go:229 github.com/milvus-io/milvus/internal/proxy.(*LookAsideBalancer).checkQueryNodeHealthLoop.func1.1\n  | /go/src/github.com/milvus-io/milvus/pkg/util/conc/pool.go:81 github.com/milvus-io/milvus/pkg/util/conc.(*Pool[...]).Submit.func1\n  | /go/pkg/mod/github.com/panjf2000/ants/v2@v2.7.2/worker.go:67 github.com/panjf2000/ants/v2.(*goWorker).run.func1\n  | /usr/local/go/src/runtime/asm_amd64.s:1598 runtime.goexit\nWraps: (3) rpc error: code = DeadlineExceeded desc = context deadline exceeded\nError types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *status.Error"]

Please check if there are frequent connection timeout logs with querynode-53 during slow search. If so, I suggest killing this querynode and redeploying it on a different physical machine. @syang1997

@syang1997
Copy link
Author

image image I think the architecture of version 2.4 is Proxy directly to QueryNode, version 2.3 is through MQ. My Milvus version is 2.3.15. @bigsheeper
@bigsheeper
Copy link
Contributor

bigsheeper commented Jul 25, 2024

@syang1997 In the 2.3 architecture diagram, "query" refers to the process of the querynode consuming data from the message storage, rather than the query operation itself.

@syang1997
Copy link
Author

@syang1997 In the 2.3 architecture diagram, "query" refers to the process of the querynode consuming data from the message storage, rather than the query operation itself.

Yes, the timeout according to the monitoring information is when the process of proxy to querynode rather than the Search operation itself. So the problem of 2.3 is whether the transmission process of the MQ is guessing.

@syang1997
Copy link
Author

I found some warning logs that back up my point. These logs appeared during a slow search, and show that the proxy timed out connecting to querynode-53 (address=10.37.113.183:21123).

[2024/07/23 15:05:55.145 +00:00] [WARN] [grpcclient/client.go:408] ["call received grpc error"] [clientRole=querynode-53] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
[2024/07/23 15:05:55.145 +00:00] [WARN] [retry/retry.go:102] ["retry func failed"] [retried=0] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
[2024/07/23 15:05:55.145 +00:00] [WARN] [grpcclient/client.go:554] ["ClientBase Call grpc call get error"] [role=querynode-53] [address=10.37.113.183:21123] [error="stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace\n/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:553 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call\n/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:567 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:88 github.com/milvus-io/milvus/internal/distributed/querynode/client.wrapGrpcCall[...]\n/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:102 github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).GetComponentStates\n/go/src/github.com/milvus-io/milvus/internal/proxy/look_aside_balancer.go:229 github.com/milvus-io/milvus/internal/proxy.(*LookAsideBalancer).checkQueryNodeHealthLoop.func1.1\n/go/src/github.com/milvus-io/milvus/pkg/util/conc/pool.go:81 github.com/milvus-io/milvus/pkg/util/conc.(*Pool[...]).Submit.func1\n/go/pkg/mod/github.com/panjf2000/ants/v2@v2.7.2/worker.go:67 github.com/panjf2000/ants/v2.(*goWorker).run.func1\n/usr/local/go/src/runtime/asm_amd64.s:1598 runtime.goexit: rpc error: code = DeadlineExceeded desc = context deadline exceeded"] [errorVerbose="stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace: rpc error: code = DeadlineExceeded desc = context deadline exceeded\n(1) attached stack trace\n  -- stack trace:\n  | github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call\n  | \t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:553\n  | github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n  | \t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:567\n  | github.com/milvus-io/milvus/internal/distributed/querynode/client.wrapGrpcCall[...]\n  | \t/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:88\n  | github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).GetComponentStates\n  | \t/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:102\n  | github.com/milvus-io/milvus/internal/proxy.(*LookAsideBalancer).checkQueryNodeHealthLoop.func1.1\n  | \t/go/src/github.com/milvus-io/milvus/internal/proxy/look_aside_balancer.go:229\n  | github.com/milvus-io/milvus/pkg/util/conc.(*Pool[...]).Submit.func1\n  | \t/go/src/github.com/milvus-io/milvus/pkg/util/conc/pool.go:81\n  | github.com/panjf2000/ants/v2.(*goWorker).run.func1\n  | \t/go/pkg/mod/github.com/panjf2000/ants/v2@v2.7.2/worker.go:67\n  | runtime.goexit\n  | \t/usr/local/go/src/runtime/asm_amd64.s:1598\nWraps: (2) stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace\n  | /go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:553 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call\n  | /go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:567 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n  | /go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:88 github.com/milvus-io/milvus/internal/distributed/querynode/client.wrapGrpcCall[...]\n  | /go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:102 github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).GetComponentStates\n  | /go/src/github.com/milvus-io/milvus/internal/proxy/look_aside_balancer.go:229 github.com/milvus-io/milvus/internal/proxy.(*LookAsideBalancer).checkQueryNodeHealthLoop.func1.1\n  | /go/src/github.com/milvus-io/milvus/pkg/util/conc/pool.go:81 github.com/milvus-io/milvus/pkg/util/conc.(*Pool[...]).Submit.func1\n  | /go/pkg/mod/github.com/panjf2000/ants/v2@v2.7.2/worker.go:67 github.com/panjf2000/ants/v2.(*goWorker).run.func1\n  | /usr/local/go/src/runtime/asm_amd64.s:1598 runtime.goexit\nWraps: (3) rpc error: code = DeadlineExceeded desc = context deadline exceeded\nError types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *status.Error"]

Please check if there are frequent connection timeout logs with querynode-53 during slow search. If so, I suggest killing this querynode and redeploying it on a different physical machine. @syang1997

@bigsheeper So is the log explanation request directly sent to QueryNode directly from proxy?

@bigsheeper
Copy link
Contributor

bigsheeper commented Jul 25, 2024

Yes, the timeout according to the monitoring information is when the process of proxy to querynode rather than the Search operation itself. So the problem of 2.3 is whether the transmission process of the MQ is guessing.

From the monitoring data, the MQ has no affect to the slow search. So no need to suspect the MQ is this issue.

@bigsheeper So is the log explanation request directly sent to QueryNode directly from proxy?

Yep, in milvus, search requests are sent directly from proxy to querynode, so the problem is due to network instability between the proxy and the querynode. @syang1997

@syang1997
Copy link
Author

Yes, the timeout according to the monitoring information is when the process of proxy to querynode rather than the Search operation itself. So the problem of 2.3 is whether the transmission process of the MQ is guessing.

From the monitoring data, the MQ has no affect to the slow search. So no need to suspect the MQ is this issue.

@bigsheeper So is the log explanation request directly sent to QueryNode directly from proxy?

Yep, in milvus, search requests are sent directly from proxy to querynode, so the problem is due to network instability between the proxy and the querynode. @syang1997

Okay, I will continue to position in this direction

@syang1997
Copy link
Author

@bigsheeper The link about query in the 2.3 architecture diagram is easy to misunderstand. Can you consider optimizing it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
5 participants