-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Very high p99 with low requests and sufficient resources #34476
Comments
quick questions:
/assign @congqixia @syang1997 |
|
Node partial monitoring indicators @yanliang567 |
/assign @congqixia |
@yanliang567 Who can help with the investigation? Recently it appeared again. |
not have any clues yet. We need more info in logs to know what was happening at that moment. Please offer the full milvus pod logs during the period. |
milvus-log (3).tar.gz
But I did not filter a similar query time in the log of QueryNode. It was initially suspected that it was a MQ problem, but the resource usage rate of monitoring and viewing Pulsar was very low, and the nodes at the time were not abnormal. |
@yanliang567 Can you help me analyze the cause of timeout? |
okay, let me check the logs |
Hello @syang1997 , could you please provide us the monitoring for wait tsafe latency? |
|
Additionally, please attach the metric screenshots around 2024/07/23 15:05(+1h、-1h), it would be helpful for us to address the issue. @syang1997 |
@bigsheeper The delay waiting for search is long, but the delay of QueryNode is not long. |
Is there any way to export the specified time period? It has been a long time to introduce the log script to choose the 24 -hour log is too large |
@syang1997 The wait tsafe latency monitoring is like this: |
I didn't find this panel, which version of Grafana Dashboard you used |
It's ok, if the querynode search request latency is low, then the wait tsafe latency is likely to be low as well. |
@bigsheeper @yanliang567 What more information do you need from me? |
@syang1997 quick question, have you detected any network issue during this problem happened? |
Judging from the node network monitoring, there is no problem with the network |
what about your host machine cpu? |
you may also check the go garbage collection and K8s throttle info |
The nodes 'indicators are normal, cpu utilization is very low, and network bandwidth is not high. I also uploaded node monitoring information before |
I checked the gc status of the components in milvus monitoring and found no problems. As for k8s resources, there are no oversold or restricted situations |
@yanliang567 My preliminary judgment is that the problem lies with plusar, but plusar's cpu and memory usage are very low. Is it caused by plusar's pvc being exhausted and cleaned up? I haven't checked Pulsar's PVC usage yet. |
Okay, I have deployed the latest panel. What indicator information do I need? |
@syang1997 Sorry, our colleague is currently submitting a PR to the master branch to update the configuration. We will notify you once the PR is merged. |
Okay, please as soon as possible, because surveillance information is only retained for 7 days |
/kind improvement cc @yanliang567 issue: #34476 Signed-off-by: Edward Zeng <jie.zeng@zilliz.com>
@bigsheeper I upgraded grafana. What monitoring indicators do I need to provide |
@syang1997 It would be best to have all the dashboards for the proxy and querynode. |
@syang1997 The monitoring looks fine, I still think it might be a network issue between the proxy and the querynode. I meant stability, not throughput related. |
I found some warning logs that back up my point. These logs appeared during a slow search, and show that the proxy timed out connecting to
Please check if there are frequent connection timeout logs with |
I think the architecture of version 2.4 is Proxy directly to QueryNode, version 2.3 is through MQ. My Milvus version is 2.3.15. @bigsheeper |
@syang1997 In the 2.3 architecture diagram, "query" refers to the process of the querynode consuming data from the message storage, rather than the query operation itself. |
Yes, the timeout according to the monitoring information is when the process of proxy to querynode rather than the Search operation itself. So the problem of 2.3 is whether the transmission process of the MQ is guessing. |
@bigsheeper So is the log explanation request directly sent to QueryNode directly from proxy? |
From the monitoring data, the MQ has no affect to the slow search. So no need to suspect the MQ is this issue.
Yep, in milvus, search requests are sent directly from proxy to querynode, so the problem is due to network instability between the proxy and the querynode. @syang1997 |
Okay, I will continue to position in this direction |
@bigsheeper The link about query in the 2.3 architecture diagram is easy to misunderstand. Can you consider optimizing it? |
Is there an existing issue for this?
Environment
Current Behavior
During the smooth request, p99 suddenly increased to 15kms,
but the resources were sufficient and the CPU and memory were low. What was the reason?
The following is the monitoring
The following is the qureynode log
Expected Behavior
No response
Steps To Reproduce
Milvus Log
No response
Anything else?
No response
The text was updated successfully, but these errors were encountered: