[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]
Re: [tor-bugs] #31916 [Internal Services/Tor Sysadmin Team]: reliability issues with hetzner-nbg1-01
#31916: reliability issues with hetzner-nbg1-01
-------------------------------------------------+-------------------------
Reporter: anarcat | Owner: anarcat
Type: defect | Status:
| needs_review
Priority: Medium | Milestone:
Component: Internal Services/Tor Sysadmin Team | Version:
Severity: Blocker | Resolution:
Keywords: | Actual Points:
Parent ID: | Points:
Reviewer: | Sponsor:
-------------------------------------------------+-------------------------
Changes (by anarcat):
* status: assigned => needs_review
Comment:
as I can't figure out the network issue, i'm trying another tack. i've
extended the scrape_interval from 15s to 5m while raising the
retention_period from 30d to 365d. the latter shouldn't take effect for 30
days while the former will finish converting the database within 30 days.
if, after 30 days, we still have this problem, we'll know this is not
because of the aggressive retention interval and we might want to consider
setting up a secondary server (#31244) to see if it can reproduce this
problem.
or, as the commitlog said:
{{{
origin/master 7cda3928fe9c6bf83ee3e8977b74d58acbb7519a
Author: Antoine Beaupré <anarcat@xxxxxxxxxx>
AuthorDate: Tue Oct 22 13:46:05 2019 -0400
Commit: Antoine Beaupré <anarcat@xxxxxxxxxx>
CommitDate: Tue Oct 22 13:46:05 2019 -0400
Parent: 91e379a5 make all mpm_worker paramaters configurable
Merged: master sudo-ldap
Contained: master
downgrade scrape interval on internal prometheus server (#31916)
This is an attempt at fixing the reliability issues on the prometheus
server detailed in #31916. The current theory is that ipsec might be
the culprit, but it's also possible that the prometheus is overloaded
and that's creating all sorts of other, unrelated problems.
This is sidetracking the setup of a *separate* long term monitoring
server (#31244), of course, but I'm not sure that's really necessary
for now. Since we don't use prometheus for alerting (#29864), we don't
absolutely /need/ redundancy here so we can afford a SPOF for
Prometheus while we figure out this bug.
If, in thirday days, we still have reliability problems, we will know
this is not due to the retention period and can cycle back to the
other solutions, including creating a secondary server to see if it
reproduces the problem.
1 file changed, 2 insertions(+), 1 deletion(-)
modules/profile/manifests/prometheus/server/internal.pp | 3 ++-
modified modules/profile/manifests/prometheus/server/internal.pp
@@ -42,7 +42,8 @@ class profile::prometheus::server::internal (
vhost_name => $vhost_name,
collect_scrape_jobs => $collect_scrape_jobs,
scrape_configs => $scrape_configs,
- storage_retention => '30d',
+ storage_retention => '365d',
+ scrape_interval => '5m',
}
# expose our IP address to exporters so they can allow us in
#
}}}
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/31916#comment:9>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
_______________________________________________
tor-bugs mailing list
tor-bugs@xxxxxxxxxxxxxxxxxxxx
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs