I have been involved into a case where an important application has been behaving erratically in regards of execution time. This is the background.
The application is customised installation of a very-well-known analtics suite in use at a very large shipping and logstics Dutch company. This has been deployed in AWS with the back storage provided by a rather large Postgres RDS both in acceptance and in production environment, with the former being a slimmed down installation when compared to the latter. The UI is web-based.
The company I work with is in charge for the cloud infra, and that alone.
The particular issue happens when browsing a list of postal addresses. This very function is used to browse and search among about 9 million postal addresses all around the country (the Netherlands for the records). To my personal opinion and experience this is not a really darge data set. At the very start, this function displays the first 100 addresses with the usual search boxes, paging widgets and a number that represents the overall counter. You can click on an address and get an expanded view. For some unclear reasons the problem has been moved to the company responsible for the infrastructure.
In acceptance the function seems to work quite well 95% of the times: the first page appears almost instantaneously but 5 times out of 100 it can take up to two minutes (yes, 120 seconds!) before displaying any data. In production, counterintuitively, it’s the other way around: the startup takes a very log time to display the first page of data. Subsequent pages appear almost instantaneously in either environment.
The whole investigation was time-boxed to just 1 working day.
First ask for the logs!
As usual I tried to gather as much details as possible about the application itself. It’s a name I have heard and seens in almost all my previous working experiences from the past 10 years, but this is my very first time to look at it this deep inside.
According to the customer the specific function is written in Java (I think likely almost all of the application is Java) and is hosted under the Apache suite. There is a load balancer in front of a multi-availability zone setup. Thus it’s far from an unusual setup.
Despite asking over and over, I had no chance to get to the application sourcecode itself. So I focused on the one and only part I can freely access: the DB.
I got immediately to the configuration parameters to log query execution time. The aim is double: know how the application interacts with the database and know how long it takes to execute those queries. The friendly manual told me how to. AWS RDS requires you to setup a parameter group, use it to configure the DB and to reboot the instanc to make it in use. So I did my setup:
log_min_duration_statement = 0 # (ms) Sets the minimum execution time above which statements will be logged.
log_statement = all # Sets the type of statements logged.
log_duration = 1 # (boolean) Logs the duration of each completed SQL statement.
Basically I wanted to know which queries were run, all of them, and how long they took from the DB perspective no matter the actual duration.
Of course, a few interesting things popped up. Let’s have a look.
First I had to figure out how the tables are named. I could have run my faithful PgAdmin tool but a few searches with vim editor over the log files was effective enough.
The main table used for those addresses is called
ebx_pdl_address and to get the number of its rows the actual query is:
SELECT n_live_tup FROM pg_stat_all_tables WHERE relname = lower('ebx_pdl_address')
count(*). Of course. Instead of counting the lines in a table the application is asking the DB for an estimate of the “number of live rows” in that table. So we will get whatever the DB staistics keeper thinks it’s the number of rows. This number can be different from the actual number. This is done because in Postgres the
count aggregate function “requires effort proportional to the size of the table: PostgreSQL will need to scan either the entire table or the entirety of an index that includes all rows in the table“. The programmer has decided that and estimate is more than enough. I cannot agree more on this decision as the user won’t bother to be off even by a few percentages over 9+ million rows.
Soon after I found the query I was looking for.
SELECT t0.T_LAST_USER_ID, t0.T_CREATOR_ID, t0.T_CREATION_DATE, t0.T_LAST_WRITE, t0.general_addressid, t0.general_housenumber, t0.general_housenumberaddition, t0.general_postalcodearea_, t0.general_street_, t0.general_status, t0.general_addresstype, t0.general_addressableobjecttype, t0.general_parcelnumber, t0.general_parcelcode__addresstype, t0.general_parcelcode__parcelcode, t0.general_controlindicator, t0.general_letterboxnumber, t0.general_deliverypointnumber, t0.general_deliverynumber, t0.general_qualityletterbox, t0.general_additionalrouteflat, t0.general_additionalroutestaircase, t0.general_additionalrouteelevator, t0.general_additionalroutetype, t0.general_startdate, t0.general_enddate, t0.general_lastmutationdate, t0.general_versionstartdate, t0.general_cityidbag, t0.general_municipalityidbag, t0.general_publicspacebag, t0.general_pandidbag, t0.general_statuspandbag, t0.general_verblijfsobjectidbag, t0.general_statusverblijfsobjectbag, t0.general_numberdisplaybag, t0.general_statusnummeraanduidingbag, t0.general_gebruiksdoelbag, t0.general_bagindexpchnrht, t0.general_bagindexwplstreethnrht, t0.general_bagindexwplbagopenbareruimteidhnrht, t0.general_frominitialdataload, t0.general_changedindicator FROM ebx_pdl_address t0 ORDER BY t0.general_addressid; -- This is a primary key
Please, spend a minute or two over it.
Basically, this query is pulling all 43 (why not 42?) columns from all 9+ million rows in that table. At once.
WHERE condition, no filtering
The duration is little more that 128,000 milliseconds. More than 2 minutes!
If a single line weighed a mere 256 bytes (a quarter of a KiB, 6 bytes per columns on average), we would be transferring from the DB to the client more than 2 GiB of data, possibly over an SSL-encrypted channel! If we did that over a 1Gbps ethernet that would require not less than 20 seconds.
A quick check with PgAdmin confirmed me that it is a real table (not a view) and that the column size is about 700 bytes.
Now, let’s think about the software architecture and make some (possibly reasonable) assumption.
We have an RDS, talking to a Java application triggered by a web server. In front of which there is a web load balancer and, finally, the client.
In the case tha paged view of the data set is managed by the Java application itself, then the data is copied over the network only once and cached into the application where the paging happens.
In case the paged view is managed by browser, then there would possibly be two more data copies over the network. I am assuming that internal communication within the Apache stack is shared memory-based, with no copy at all.
I tried to run the same query from a client sitting in the same account as the RDS via the psql command line tool. At best that query required 45 seconds to run. At worst I got the client aborting due to insufficient memory.
I got flabbergasted! I really could not believe my eyes.
The erratic execution delay could have been caused by some application caching. Or even browser caching for what I could know. For sure copying even a single GiB of data from DB to the application seems a näive solution at least.
Moreover I found out that in acceptance there were very few updates to that table, while in production that table was receiving frequent updates. If my caching assumption was true, frequent updates would lead to frequent cache invalidation and new queries to be run.
I briefly discussed this with my colleague involved into this story and wrapped up a conclusion presentation with the above picture attached to it.
Conclusions to the customer
These are the questions the customer asked and the replies I gave during the final session.
Q. Can this problem be solved with a larger RDS instance?
A. From the data I have got I cannot suggest for a larger RDG instance.
Q. Is there any Postgres setting we can tune in order to get better performance?
A. From the data I have got I cannot suggest for configuration tuning that can lead to performance increse better than a few percentage points.
Q. Is there any change that can be applied to the indexes in order to mitigate the problem?
A. No, there is none. The only one in use is the one linked to the primary key (
Q. Do yoiu think we can fix this by switching to <very famous and expensive SQL server>?
A. From the data I have got I cannot suggest such a change as the issue seems to be software architecture-related.
Q. How can we possibly fix this issue?
A. The minimal changes needed to try to fix the issue is to change both the application logics and the query design. The fix requires some software rework and an internal application logic change.
My own conclusions
Acceptance tests should be completely run with positive results before moving to production.
Acceptance tests should include performance tests with a clear and agreeed definition of “expected performance and behavior“.
Näive and careless application and DB design is hard to fix.
Long live to Postgres (and its logs)!