#soylent | Logs for 2025-10-02

« return
[00:17:32] <c0lo> Yeah, excess patriotism, parade and flags rarely help
[01:00:14] -!- bender has quit [Remote host closed the connection]
[01:00:19] -!- bender [bender!bot@Soylent/BotArmy] has joined #soylent
[01:00:20] -!- bender has quit [Changing host]
[01:00:20] -!- bender [bender!bot@Soylent/Bot/Bender] has joined #soylent
[01:00:55] -!- devbot has quit [Remote host closed the connection]
[01:00:56] -!- devbot [devbot!devbot@Soylent/BotArmy] has joined #soylent
[01:01:55] -!- Loggie [Loggie!Loggie@Soylent/BotArmy] has joined #soylent
[02:07:27] <schestowitz[TR]> <a
[02:12:28] <schestowitz[TR]> Tbe
[02:14:59] <schestowitz[TR]> So
[02:22:47] <ted-ious> If you sy so.
[02:48:53] <schestowitz[TR]> oops, sorry, wrong window
[02:50:21] -!- progo [progo!~progo@eegc-73-589-96-43.nwrknj.fios.verizon.net] has joined #soylent
[03:38:58] -!- systemd [systemd!~systemd@pid1] has joined #soylent
[07:15:23] -!- Runaway1956 [Runaway1956!~OldGuy@the.abyss.stares.back] has joined #soylent
[07:16:13] -!- Runaway has quit [Ping timeout: 245 seconds]
[08:29:08] -!- schestowitz[TR] has quit [Ping timeout: 245 seconds]
[08:29:30] -!- schestowitz[TR] [schestowitz[TR]!~schestowi@2a00:23c8:7480:ntlo:inlr:kiy:mnxm:utyh] has joined #soylent
[08:47:08] -!- c0lo [c0lo!~c0lo@124.190.mg.jlq] has joined #soylent
[08:50:33] <Ingar> 200 OK
[09:08:18] -!- schestowitz[TR] has quit [Ping timeout: 245 seconds]
[09:28:54] -!- schestowitz[TR] [schestowitz[TR]!~schestowi@swuj40-76-144-95.range88-95.btcentralplus.com] has joined #soylent
[09:42:02] <Ingar> 500 Internal Server error
[09:42:53] <janrinok> thx
[09:47:16] <janrinok> Unfortunately, nothing is showing why it happened. It is not overloaded, sufficient memory, etc
[09:49:25] <fab23> but a web server should log also error messages with more details, or the application server behind
[09:51:16] <janrinok> yes, but I cannot get to the logs very easily.
[09:51:53] <janrinok> I've left some info for kolie on another channel
[09:54:33] <janrinok> we are currently seeing a significant increase in Alert emails but there appears to be no obvious pattern as to which alerts are triggering or why.
[09:55:23] <janrinok> I used to see a handful each day but currently I am seeing 45-50 alert emails per day.
[10:16:02] <fab23> hunting for error causes usually is a rabbit hole, so it may take significant time
[10:44:58] -!- c0lo has quit [Ping timeout: 245 seconds]
[12:04:20] -!- nosler [nosler!~nosler@207.148.hyp.qg] has joined #soylent
[12:06:04] -!- nosler has quit [Client Quit]
[13:58:43] -!- progo has quit [Ping timeout: 245 seconds]
[13:59:10] -!- progo [progo!~progo@eegc-73-589-96-43.nwrknj.fios.verizon.net] has joined #soylent
[14:05:23] -!- progo has quit [Ping timeout: 245 seconds]
[14:14:08] -!- schestowitz[TR] has quit [Ping timeout: 245 seconds]
[14:24:36] -!- schestowitz[TR] [schestowitz[TR]!~schestowi@2a00:23c8:7480:pxtk:jttn:uhs:snyx:nnxj] has joined #soylent
[14:58:09] -!- progo [progo!~progo@eegc-73-589-96-43.nwrknj.fios.verizon.net] has joined #soylent
[15:07:28] -!- schestowitz[TR] has quit [Ping timeout: 245 seconds]
[15:09:58] -!- schestowitz[TR] [schestowitz[TR]!~schestowi@swuj40-76-144-95.range88-95.btcentralplus.com] has joined #soylent
[18:46:30] <prg> I'm getting a couple of 503s now.
[18:49:34] -!- bender has quit [Remote host closed the connection]
[18:49:37] -!- bender [bender!bot@Soylent/BotArmy] has joined #soylent
[18:49:38] -!- bender has quit [Changing host]
[18:49:38] -!- bender [bender!bot@Soylent/Bot/Bender] has joined #soylent
[18:53:04] <prg> Happened three times within a couple of minutes. Now it seems to be fine again.
[18:53:07] <janrinok> Yeah, we are working on that problem at the moment. We see it too. Thanks for the confirmation though. The site goes very sluggish just before the 5xx appear
[18:53:19] <kolie> I uhh
[18:53:24] <kolie> reset the database which caused the errors.
[18:53:32] <kolie> I did some other related work.
[18:53:42] <janrinok> I say "we" are working on it - kolie is working on it and I am asking dumb questions...
[18:53:45] <kolie> And uhh, well I think we are in normal performance range now.
[18:54:55] <ted-ious> Is it too hard to setup a script that runs from another network that just tries the homepage every minute and compare those to the httpd logs?
[18:55:03] <prg> Yeah it seems to be pretty fast again. Thanks for your work.
[18:55:21] <ted-ious> A free shell account should be good enough as long as it is running ntp and has accurate time stamps.
[18:55:36] <kolie> What information would that be intended to provide ted-ious ?
[18:55:49] <janrinok> ted-ious, I have such a thing watching new accounts - but it does not yet record why it is having problems with the connection. I hope to rewrite part of it this weekend.
[18:56:14] <ted-ious> I assume that the soylent logs also have accurate time stamps so it should be simple to correlate them.
[18:56:27] <kolie> hahahaha
[18:56:33] <kolie> I wish.
[18:56:48] <kolie> The apache logs, I mean I can see "issue"
[18:56:57] <kolie> Figuring out why at that time, good luck.
[18:57:20] <ted-ious> Are you getting your 503's from your web browser or something else prg?
[18:57:21] <kolie> rn, there are numerous indications and "issues" around the stack, and I got a pretty good picture of whats going on.
[18:57:34] <kolie> prg's issue was almost certainly, me rebooting the database.
[18:57:55] <kolie> I'm doing that in response to about 3 hours ago we identified clustered 500s.
[18:58:14] <kolie> About 2:30 hours ago, we stopped the cluster 503s, but there was still the issue of why the system was loading up
[18:58:24] <kolie> 503 was caused by db connections being exhausted.
[18:58:29] <kolie> I just set connections higher.
[18:58:37] <kolie> Now we are using too many resources, which causes other issues.
[18:59:06] <chromas> How many db connections do you need?
[18:59:14] <janrinok> both of them....
[18:59:16] <kolie> ~200
[18:59:17] <chromas> Shouldn't it just be one per http worker?
[18:59:38] <kolie> Well why we have gotten 500s in the last three months
[18:59:46] <kolie> Is that rehash goes to connect to the db
[18:59:51] <kolie> and gets a max connections warning
[18:59:54] <kolie> because 150 isnt enough.
[19:00:09] <kolie> Why? degraded performance elsewhere.
[19:00:28] <janrinok> my software is dropping ofline - lost connection
[19:01:00] <kolie> op - 19:00:54 up 2 days, 18:21, 4 users, load average: 267.33, 176.03, 121.75
[19:01:21] <janrinok> site down for me
[19:01:45] <janrinok> 503 - Backend Fetch failed
[19:02:47] <kolie> the 270 load migth have something to do with that.
[19:02:52] <prg> ted-ious: that was from my browser
[19:03:17] <prg> huh, yeah, that doesn't look healthy.
[19:04:39] <ted-ious> Oh ok.
[19:04:54] <ted-ious> I thought maybe you had your own script checking.
[19:05:03] <kolie> this looks like uhh
[19:05:06] <kolie> the backup.
[19:05:19] <prg> no, I was just reading the site.
[19:05:24] <janrinok> yep, bang on schedule
[19:06:02] <kolie> and it appears the the database is locked, but not by the backup, but that is waiting on something else too.
[19:06:45] <janrinok> the site is responding again but sluggish
[19:07:05] <kolie> well got some places to look atleast.
[19:07:28] <janrinok> yeah, but it sounds those places are Rehash and everywhere else :)
[19:07:50] <kolie> more information then ive had before so?
[19:07:55] <janrinok> lol
[19:09:57] <janrinok> ah, it is the witching hour for me. I'll be monitoring the site for an hour or two more but not have IRC available.
[19:10:16] <janrinok> See you all tomorrow - and good luck kolie!
[19:11:12] <kolie> thx.
[19:17:48] <kolie> ok well memory pressure is down.
[19:17:59] <kolie> I think we need to allocate some more cores to the infra.
[19:18:13] <kolie> The load numbers are just too high.
[19:18:26] <kolie> I got them down to 28 from 250
[19:18:41] <kolie> top - 19:18:35 up 2 days, 18:39, 4 users, load average: 18.55, 64.41, 100.99
[19:19:29] <kolie> the apache/perl architecture is uhh, well, fun :)
[21:12:02] -!- esainane [esainane!esainane@sendjocq.space] has joined #soylent