How to pinpoint CPU spike?

I have a v13 running in prod mode on an AWS instance.
Sometimes (like once every two-three weeks) the CPU and network load spikes, see attached image from AWS cloudwatch. As a result the UI gets very slow…

I tried to look through the frappe-bench/logs but nothing I could figure out…

Any tips or hints how I could figure out what causes this?

Does it happen at the same time of day every time? If you are able to ssh into the instance while its happening, just open htop and check what’s running.

Additionally I would check:

  • cron jobs crontab -e and sudo crontab -e
  • os logs sudo journalctl -xe

Does this occur recurrently? or is it random? do you have custom background jobs on your instance? you can take a look at the background jobs if the spike occurs recurrently.

Does not happen every day, also not somehow regularly. At least I can’t spot the schema.

Will probably try to configure a trigger on AWS to get notified if the CPU exceeds some value… maybe 30%.

I also found that fail2ban was quite busy, but I would not expect that this can cause such a spike?

@VamYip thanks for the hints, I will check cron and logs!

nothing in the cron logs, nothin in the OS logs.

but i did manage to get one of the spikes in htop.
seems like the gunicorn is taking lots of CPU?

my supervisor.conf

[program:frappe-bench-frappe-web]
command=/home/frappe/frappe-bench/env/bin/gunicorn -b 127.0.0.1:8000 -w 5 -t 120 --threads 6 --keep-alive 4 frappe.app:application --preload --worker-tmp-dir /dev/shm
priority=4
autostart=true
autorestart=true
stdout_logfile=/home/frappe/frappe-bench/logs/web.log
stderr_logfile=/home/frappe/frappe-bench/logs/web.error.log
user=frappe
directory=/home/frappe/frappe-bench/sites

any ideas?

edit: running on AWS t3.medium, 2 CPUs 4GB of RAM. do I need to take something bigger?
never more than 5-10 users concurrently online… are the gunicorn parameters wrong?

I think there is a proble in version 13 am having the same issess. Am running v11 for 3 years with over 15 customer with different database on t2.medium and this never happen. When i upgrade to v13 this start moving very slow.