[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]
Re: [tor-bugs] #31244 [Internal Services/Tor Sysadmin Team]: long term prometheus metrics
#31244: long term prometheus metrics
-------------------------------------------------+-------------------------
Reporter: anarcat | Owner: anarcat
Type: enhancement | Status:
| assigned
Priority: Medium | Milestone:
Component: Internal Services/Tor Sysadmin Team | Version:
Severity: Normal | Resolution:
Keywords: | Actual Points:
Parent ID: | Points:
Reviewer: | Sponsor:
-------------------------------------------------+-------------------------
Changes (by anarcat):
* owner: tpa => anarcat
* status: new => assigned
Comment:
i've decided to postpone the creation of a secondary server and instead
change the retention period on the current server to see if it fixes
reliability issues detailed in #31916. if, in 30 days, we still have this
problem, then we can setup a secondary to see if we can reproduce the
problem there. after all, we don't need a redundant setup as long as we
don't do alerting, for which we still use nagios (#29864). see also the
commit log for more details:
{{{
origin/master 7cda3928fe9c6bf83ee3e8977b74d58acbb7519a
Author: Antoine Beaupré <anarcat@xxxxxxxxxx>
AuthorDate: Tue Oct 22 13:46:05 2019 -0400
Commit: Antoine Beaupré <anarcat@xxxxxxxxxx>
CommitDate: Tue Oct 22 13:46:05 2019 -0400
Parent: 91e379a5 make all mpm_worker paramaters configurable
Merged: master sudo-ldap
Contained: master
downgrade scrape interval on internal prometheus server (#31916)
This is an attempt at fixing the reliability issues on the prometheus
server detailed in #31916. The current theory is that ipsec might be
the culprit, but it's also possible that the prometheus is overloaded
and that's creating all sorts of other, unrelated problems.
This is sidetracking the setup of a *separate* long term monitoring
server (#31244), of course, but I'm not sure that's really necessary
for now. Since we don't use prometheus for alerting (#29864), we don't
absolutely /need/ redundancy here so we can afford a SPOF for
Prometheus while we figure out this bug.
If, in thirday days, we still have reliability problems, we will know
this is not due to the retention period and can cycle back to the
other solutions, including creating a secondary server to see if it
reproduces the problem.
1 file changed, 2 insertions(+), 1 deletion(-)
modules/profile/manifests/prometheus/server/internal.pp | 3 ++-
modified modules/profile/manifests/prometheus/server/internal.pp
@@ -42,7 +42,8 @@ class profile::prometheus::server::internal (
vhost_name => $vhost_name,
collect_scrape_jobs => $collect_scrape_jobs,
scrape_configs => $scrape_configs,
- storage_retention => '30d',
+ storage_retention => '365d',
+ scrape_interval => '5m',
}
# expose our IP address to exporters so they can allow us in
#
}}}
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/31244#comment:5>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
_______________________________________________
tor-bugs mailing list
tor-bugs@xxxxxxxxxxxxxxxxxxxx
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs