Featured

Temporary functions (kind of) without schema qualifiers

In a previous article of mine I’ve been bitten by the “temporary function issue” (which isn’t an issue at all, of course).

I needed a way to use the now() function in different ways to do a sort of “time travel” over a history table.

I’ve found a few easy™ ways to accomplish the same task and have now distilled one that seems to me to be the “best one“™. I will make use of a temporary table.

The trick is that a new function, called mynow() will access a table without an explicit schema qualifier but relying on a controlled search_path. This approach opens the door to a controlled table masking thanks to the temporary schema each session has. Let’s see this function first.

create or replace function mynow( out ts timestamp )
language plpgsql
as $l0$
begin
  select * into ts from mytime;
  if not found then
    ts := now();
  end if;
end;
$l0$;

If you put a timestamp in the table mytime, then that timestamp will be used a the current time. If there’s nothing in there, then the normal function output will be used.

First of all, you need to create an always-empty non-temporary table like this in the public schema:

create table mytime ( ts timestamp );

So the function will start working soon with the default time line. If I inserted a time stamp straight in there, I’d set the function behavior for all sessions at once. But this isn’t normally the objective.

As soon as a user needs a different reference time for its time travelling, she needs to do the following:

set search_path = pg_temp,"$user", public; -- 1st
create table mytime ( like public.mytime including all ); -- 2nd
insert into mytime values ( '2000-01-01' ); -- 3rd

That’s it. More or less.

The first statement alters the search_path setting so the pg_temp schema becomes the first one to be searched in and, thus, the “default” schema.

The second one creates a temporary table (re-read the previous sentence, please!) “just like the public one“. Please note the schema qualifying used with the like predicate.

The third one will insert into that temporary table (you re-read it, didn’t you?) a value to be used as the reference time in your time traveling.

Time to test, now. Without closing the current PostgreSQL connection try the following:

select * from mynow();
         ts
---------------------
2000-01-01 00:00:00
(1 row)

If you close the connection and re-open it, the pg_temp schema will vanish and the very same query will behave “just like the plain oldnow() function.

The same behavior can be accomplished without closing the connection by either deleting from the temporary table or by resetting the search_path to it normal value, thus “unmasking” the persistent table and resetting the function behavior;

If you go back to my article, you can replace the function call now() with mynow() and enable a full blown table time travelling.

Of course, there are also alternative implementations and even those with more care in enforcing more controlled behaviors. Feel free to propose your very own.

Advertisements
Featured

PostgreSQL and the temporary functions

In my previous article I’ve used the builtin function now() to select from a history table only those rows that are “current“.

A “brilliant next idea” that came to my mind was: “if I can redefine now() in my temporary schema (one for each session) I can browse that table as if I were either in the past or in the future, thus opening my solution to a number of other applications“.

So I’ve tried to go this path.

create or replace temporary function now( out ts timestamp )
volatile language plpgsql as $l0$
begin
  ts = '2017-01-01 00:00:00'::timestamp;
end;
$l0$;

It doesn’t work: there’s no such a command like “create temporary function ...” in PostgreSQL!
So I’ve tried a different syntax.

create or replace function pg_temp.now( out ts timestamp )
volatile language plpgsql as $l0$
begin
  ts = '2017-01-01 00:00:00'::timestamp;
end;
$l0$;

This works indeed. Let’s check.

tmp1=# select * from now();
              now              
-------------------------------
 2018-05-30 10:48:19.093572+02
(1 row)

Time: 0,311 ms
tmp1=# select * from pg_temp.now();
         ts          
---------------------
 2017-01-01 00:00:00
(1 row)

Time: 0,790 ms

It works. In order to “mask” the builtin with the temporary I should change the search_path like this:

tmp1=# show search_path ;
   search_path   
-----------------
 "$user", public
(1 row)

Time: 0,354 ms
tmp1=# set search_path to "$user", pg_temp, public;
SET
Time: 0,382 ms

But …

tmp1=# select * from now();
              now              
-------------------------------
 2018-05-30 10:49:54.077509+02
(1 row)

Time: 0,201 ms
tmp1=# select * from pg_temp.now();
         ts          
---------------------
 2017-01-01 00:00:00
(1 row)

Time: 0,628 ms

It doesn’t work as expected! What? Is this a bug? Of course … it is not!

I’ve found the answer with some search.
There’s an explicit reference to such a case in the PostgreSQL general mailing list, by Tom Lane himself back in 2008. I quote (with slight edits):

You can [create a temporary function] today, as long as you don’t mind schema-qualifying uses of the function!
That’s intentional because of the risk of trojan horses.

I am not really sure that it will protect from evil behaviors: an evil temporary table can mask a good one and change the behavior of an application. And if I want to avoid any masking trick involving the search_path, I would always schema qualify all object references. Wouldn’t I?

Anyway, this means that my trick won’t ever work, not in the way I’ve devised. I have a number of other ideas here involving either the do statement or temporary tables or maybe something else …

Featured

History table: my (very own) design pattern

Modern applications often require the ability to track data changes along with the data itself. Think about stock prices, exchange rates and the likes. It’s also possible that the concept of “current data” don’t even have a meaning within a specific application realm.

For such cases a “history table” can be a good solution, while not the one and only. The aim of a history table is to keep a precise track of how a certain set of values has changed in time.

Let’s stick with the stock prices example and let’s start with an initial naif implementation.

create table stockprices (
  stock text primary key,
  price numeric
);

(Sorry, but I cannot really bear uppercase SQL. My fault!)

This simple table can actually record the prices of stocks. But it’s way too simple as we need to know when that price has been updated. Let’s go back to the editor.

create table stockprices (
  stock text primary key,
  price numeric,
  enter timestamp
);

The idea is that every time you update a row with a new price or insert it, you also update the timestamp column to record when the change happened. The primary key is there because the “natural” way to query this table is to use the stock code column to get its current price.

select * from stockprices where stock='GOOGL';

Easy.
But there’s nothing historical here: you just keep the latest value. There’re no old values as after the first insertion you keep updating the very same row. So it’s time to go historical!

First things first. If there needs to be a primary key, that won’t be the stock column. We need to get more rows for the same stock and different enter timestamps. Here we go!

create table stockprices (
  stock text,
  price numeric,
  enter timestamp
);

Of course we will still use the stock code column to query this table so we’ll need an index over that same column. We also add an index over the timestamp column: it’ll be more clear soon.

create index on stockprices( stock );
create index on stockprices( enter desc ); -- look here!

The previous query now becomes:

select distinct on( stock ) *
  from stockprices
  where stock='GOOGL'
  order by stock, enter desc;

Here we sort by stock code and by reverse enter timestamp (that is “latest first”) being equal the stock code. The distinct clause will pick the first row in a group, thus that with the latest price. The main advantage here is that the same table works for both the historical data and for the current data. There’s actually no difference between older and newer data!

If you think we’re done, you’re wrong. We’re just halfway. Stay tuned then.

When dealing with history tables my experience reminds me about a few facts.

  1. The time a datum enters isn’t almost always the time it is valid. It can enter either earlier (then it’s a forecast or prevision) or later (then it’s a correction).
  2. Corrections can happen, even multiple times, and you need to keep track of all of them.
  3. A datum can stop to exist after  certain moment as if it never existed.
  4. You can have two types of queries. One looking for current values and one looking for values as seen at a certain moment in the past (and sometimes also in the future).

Before going further on, I’d like to remind you a few technical facts that are useful for our purpose.

When you delete rows in a table, no row is actually deleted. Those are being marked as “recyclable” instead. The same happens to the related entries in the indexes.

Similarly, when you update rows in a table, no row is actually updated. Those are “deleted” (see above) and new rows with the updated values are inserted. The same happens to the related entries in the indexes.

While these things don’t sound as bad things, if deletions and updates happen very often, the table and its indexes get “bloated” with a lot of those “recyclable” entries. Bloated tables do not represent a big performance issue. Bloated indexes do. In fact, we almost always have indexes to access rows and having “holes” punched into the table storage isn’t a big issue as the index allow direct row access. The same isn’t true for indexes. The larger they are, the more RAM they need to be read from the storage and used and the more I/O you need to perform to load them. Keeping the indexes at their bare minimum storage requirements is a good thing. This is why most RDBMS have a feature, called “vacuuming”, that compacts both the table storage and the indexes and swipes those holes away. Vacuuming is also a good thing. It’s a pity you cannot really do it while the tables are being actively used: a table lock is needed in order to perform it. And, the larger the table to be vacuumed, the longer the time that lock will stay active. Vacuuming is a maintenance activity you will tend to perform during off-peak time or even with no application allowed to access those tables.

History tables help to address this problem. History tables usually don’t have any deletion or update at all. Just insertions. And search queries, of course. For this reason history tables and indexes get new rows “appended”. No hole is punched into the storage and no vacuuming is needed thereafter. It’s a concept very similar to the one used with the ledgers for bookkeeping: you always append new data. if you need to correct, you add new data to cancel and yet new data for the correction. This is also  why history table are well suited for the so-called “big data” applications. Time to get back to our history table implementation.

In a history table, besides the actual data we need (the stock code and it’s prices in our example) we need a few more columns for the housekeeping. A few of these needs cannot be explained at this stage and will be clearer later.

We need to be able to identify every single row unequivocally. A bigserial will do as it’s capacity almost reaches 1018 (billions of billions). This isn’t always needed but, I admit it, I like the idea to have it everywhere. You can skip this at your will. I call it id.

We need to know when a certain row has been inserted, no matter the data validity time is. So we add a timestamp column for that. I call it enter.

We need to know since when a certain row is to be considered effective, so another timestamp is needed. I call it valid.

Finally, we need to be able to say that a certain stock code isn’t used any more, so it’s latest price isn’t valid any more. I used a bool for that and I call it erase.

Our table structure now does need some makeup.

create table stockprices (
  stock text not null,
  price numeric not null,
  enter timestamp not null default now(),
  valid timestamp not null default '-infinity'::timestamp,
  erase bool not null default false,
  id bigserial primary key
);

I have added a few default values, not really needed in the real life, but useful to remember the meaning of some columns.

Also our query needs some makeup, to become a little bit more complex.
We first select the current values for all the available rows.

select distinct on( stock ) *
  from stockprices
  where valid <= now()
  order by stock, valid desc, enter desc;

We will get a row for each single stock code with the latest prices as of now(). We are ordering by reverse (desc) validity and by reverse insertion times. Why?
Let’s imagine we have a stock code ‘GOOGL’ for EUR 400.00 on 09-04-2018. We then check that the value is wrong. It was actually 401.00 on the very same date. So we insert a new line where the enter is naturally newer the that of the previous row. We can even apply more “corrections” on the very same “row” all with newer and newer enter timestamps. We are keeping track of all the values we have entered there, while the query will pick up only the latest of the latest. No deletion and no updates.

Once we are here, we can then expunge all those rows which have the erase flag set to false.

with x as (
  select distinct on( stock ) *
    from stockprices
    where valid < now()
    order by stock, valid desc, enter desc
)
select * from x where erase is false;

As the now() function returns newer and newer timestamps, the query will also pull in the game all those rows that have been inserted as “future” data, those whose valid column has been set in the future. For example, it can make a certain stock code “disappear”at a certain timestamp. With no maintenance intervention on the table. A row with the column erase set to true and the proper valid timestamp will enter the query row set and will make that row to disappear at the right time!

But we haven’t finished yet. Why?
The indexes haven’t got their makeup yet. Indexes is why don’t use plain files!
Everyone should follow a simple rule while designing indexes for DBs. This rule says: “None knows about the DB more than the RDBMS itself“. I’ve learned this lesson at my own expenses long ago!

Of course, we can start from a simple index design, usually based on reasonableness. We fill the tables is with some meaningful data and then ask PostgreSQL to show you how it’s doing with those data and the available indexes. It’s the wonderful world of the explain PostgreSQL extension. This is not (yet) part of the SQL standard and is one of the most interesting extensions. This extension, basically let you know the details on how the query planner will plan the query based upon its own optimizations and usage statistics PostgreSQL keeps continuously up to date. Moreover, there’s the analyze mode where the query is actually run (soon after the planning) showing the actual and real stats. I make extensive use (and so should you!) of this latter mode in order to check how good my DB, index and query design is.

First of all we need to use real life data. Sort of. You can write a small program to generate an SQL file to populate our table with a number of rows. My personal first test is based on 100k rows involving 500 different stock codes over 365 days with prices ranging from 0 to 500. Using test data that resemble as much as possible the real world ones is important. And filling a table with a few dozens or a hundred rows doesn’t look to me like “real world data“.

In order to simplify the generation of the data I’ve modified the table in order to have the column stock as int instead of text. Creating random numeric stock codes isn’t difficult. Creating random text numeric stock codes is another thing and I don’t want to waste too much time in coding for test data generation. All data will be random so there can be stock codes that never get into  the table: we also need some “not found” results. Do we?

The table, at the moment, should only have a single index needed for the primary key. It should resemble to this one:

tmp1=# \d stockprices
                                       Table "public.stockprices"
 Column |            Type             | Collation | Nullable |                 Default                  
--------+-----------------------------+-----------+----------+------------------------------------------
 stock  | integer                     |           | not null | 
 price  | numeric                     |           | not null | 
 valid  | timestamp without time zone |           | not null | '-infinity'::timestamp without time zone
 enter  | timestamp without time zone |           | not null | now()
 erase  | boolean                     |           | not null | false
 id     | bigint                      |           | not null | nextval('stockprices_id_seq'::regclass)
Indexes:
    "stockprices_pkey" PRIMARY KEY, btree (id)

Our query doesn’t look for anything special. It “just” collapses the whole table to the latest values as of now(). We see it uses all columns but price (which is the column we will be interested in) and id (which is a unique row identification number, useless for most of our needs).

We could then start by creating an index for each of those columns. Let’s do it.

create index on stockprices ( stock );
create index on stockprices ( valid );
create index on stockprices ( enter );
create index on stockprices ( erase );

After a while all indexes will be created. I’d like to remember that if we created those indexes before loading the data, we could get a uselessly slow “table population process”. Creating them all at once will save a lot of time. Later on this.
We can now ask our command line tool to show some timing information for query execution:

\timing

I actually have this command in my ~/.psqlrc file so it’s always on by default.

It’s time to fire our query and see how it will be performed by PostgreSQL.

explain analyze with x as (
  select distinct on( stock ) *
    from stockprices
    where valid < now()
    order by stock, valid desc, enter desc -- Sort on 3 columns
)
select * from x where erase is false;​

This is output on my system:

tmp1-# select * from x where erase is false;
                                                               QUERY PLAN                                                               
----------------------------------------------------------------------------------------------------------------------------------------
 CTE Scan on x  (cost=13976.82..13986.82 rows=250 width=61) (actual time=84.501..99.082 rows=500 loops=1)
   Filter: (erase IS FALSE)
   CTE x
     ->  Unique  (cost=13476.82..13976.82 rows=500 width=33) (actual time=84.497..99.015 rows=500 loops=1)
           ->  Sort  (cost=13476.82..13726.82 rows=100000 width=33) (actual time=84.496..94.757 rows=100000 loops=1)
                 Sort Key: stockprices.stock, stockprices.valid DESC, stockprices.enter DESC
                 Sort Method: external merge  Disk: 5680kB
                 ->  Seq Scan on stockprices  (cost=0.00..2435.00 rows=100000 width=33) (actual time=0.016..23.901 rows=100000 loops=1)
                       Filter: (valid < now())
 Planning time: 0.745 ms
 Execution time: 99.771 ms
(11 rows)

Time: 101,181 ms

In order to make life easier to DBAs, there’s a neat and great online help to better understand it, written and hosted by Hubert ‘depesz‘ Lubaczewski. It’s called … uh … explain. You simply paste your explain output there and you get in return a more readable format.

From bottom up, I see on line #5 a sequential table scan (which isn’t a nice thing at all!) to select only those rows that are not in the future. At line #4 that that sort over three columns is run and then on line #3 the rows are “squashed” to be made unique. At line #1 the test on the column flag is run to remove all rows which have been erased.

As a first attempt at getting it better I try to create a single index for that sort and leave the one on the erase column.

drop index stockprices_enter_idx;
drop index stockprices_valid_idx;
drop index stockprices_stock_idx;
create index on stockprices( stock, valid desc, enter desc );

Then I reboot my system (to be sure all disk caches are gone) and try again the query:

                                                                                QUERY PLAN                                                                                 
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 CTE Scan on x  (cost=7736.16..7746.16 rows=250 width=61) (actual time=0.115..48.671 rows=500 loops=1)
   Filter: (erase IS FALSE)
   CTE x
     ->  Unique  (cost=0.42..7736.16 rows=500 width=33) (actual time=0.109..48.513 rows=500 loops=1)
           ->  Index Scan using stockprices_stock_valid_enter_idx on stockprices  (cost=0.42..7486.18 rows=99990 width=33) (actual time=0.107..42.855 rows=100000 loops=1)
                 Index Cond: (valid < now())
 Planning time: 0.486 ms
 Execution time: 48.750 ms
(8 rows)

Time: 49,875 ms

Bingo! I’ve cut the time in a half.
The table sequential scan is gone, replaced by an index scan plus a condition over a single index. Scanning an index should be much better that scanning a table. Shouldn’t it? Moreover, updating a single multi-column index costs less that updating multiple single-column indexes and can yield to some important advantage during queries.

It’d be nice to squeeze some more time out of this query but I’ll leave it to the keen reader. I’ll save you some time: it’s useless (as of v10.4) to add the filter condition erase is false to the unified index (or to create one new). That condition is run over the CTE (common table expression, a temporary un-indexable table) we’ve called x.

For sure we can ditch the index on the flag, as it’s not used at all and would just cause more work on the RDBMS during insertions.

Of course we haven’t tried yet a complete query, that where you look for a specific stock code. The query could simply be:

explain with x as (
  select distinct on( stock ) *
    from stockprices
    where valid < now()
    order by stock, valid desc, enter desc
)
select * from x where erase is false and stock=42; -- Test here? No!

I won’t even try it. The condition on the stock code would be sequentially applied to the CTE, just like the one on the flag. It wouldn’t take advantage of any index!

The right way to do it is to do the selection within the CTE, like here:

explain analyze with x as (
  select distinct on( stock ) *
    from stockprices
    where valid < now() and stock=142 -- The test is better here!
    order by stock, valid desc, enter desc
)
select * from x where erase is false;

The results are different. Just a little bit:

                                                                            QUERY PLAN                                                                            
------------------------------------------------------------------------------------------------------------------------------------------------------------------
 CTE Scan on x  (cost=501.03..504.31 rows=82 width=61) (actual time=0.840..0.895 rows=1 loops=1)
   Filter: (erase IS FALSE)
   CTE x
     ->  Unique  (cost=500.54..501.03 rows=164 width=33) (actual time=0.834..0.888 rows=1 loops=1)
           ->  Sort  (cost=500.54..501.03 rows=198 width=33) (actual time=0.833..0.862 rows=215 loops=1)
                 Sort Key: stockprices.valid DESC, stockprices.enter DESC
                 Sort Method: quicksort  Memory: 41kB
                 ->  Bitmap Heap Scan on stockprices  (cost=6.45..492.98 rows=198 width=33) (actual time=0.210..0.670 rows=215 loops=1)
                       Recheck Cond: ((stock = 142) AND (valid   Bitmap Index Scan on stockprices_stock_valid_enter_idx  (cost=0.00..6.40 rows=198 width=0) (actual time=0.143..0.143 rows=215 loops=1)
                             Index Cond: ((stock = 142) AND (valid < now()))
 Planning time: 0.314 ms
 Execution time: 0.966 ms
(14 rows)

Time: 2,070 ms

We are almost done. We have a query to create a view of “current values”, we have a query to select the current value for a single stock code.

What we’re missing is a query for a set of stock codes. Of course, I am not going to make use of predicates like ... where (stock=42 or stock=142 or stock=242). First that would require you to compose a dynamic query (which isn’t a bad thing at all in general, while still error prone). Second, it just multiplies the query (or a part of it) by the number of different stock codes you are looking for. If they’re 100, the query will be likely repeated 100 times. So what?

The answer is INNER JOIN.

Let’s imagine we have a generic tool table we use for selections. This table can be used by multiple users (and for multiple purposes) and still keep track of all the selection that have been made.

create table stockselection (
  selid bigserial not null,
  stock int not null,
  primary key( selid,stock )
);

This is how it works. You insert the first stock code to search for like this:

insert into selection ( stock ) values ( 42 ) returning selid;
 selid 
-------
   666
(1 row)

INSERT 0 1

So you get the newly created value for the selid column. That’s the “selection id” that allows multiple selection queries to be run by “just” using different selection ids.
Then you insert the remaining stock codes using that very same selection id with queries like this one:

insert into selection values ( 666,142 );
insert into selection values ( 666,242 );
...

The query with the stock selection feature will become:

explain with x as (
  select distinct on( stock ) *
    from stockprices
    natural inner join stockselection # inner join
    where valid < now() and selid=666 # selection id
    order by stock, valid desc, enter desc
)
select * from x where erase is false;

This is it. More or less. Of course, there’s plenty of room for further enhancements. This article is to show ideas and to give hints, not boxed solutions. Some more work can be needed depending upon your very own case.

Enjoy your coding.

Featured

networked tar

tar + ssh = powerful tool

Intro

The tar command is, in my opinion, one of the most widely misused and underestimated tool in the Linux/Unix world. And its man pages don’t help much! But think: almost all software package formats use tar to keep all the needed stuff in a single file. There needs to be some reason!

This tool basically stores into a single file, a tar file or a tar archive, a number of files along with their meta-data, like file names, last change time and permissions. This is very useful to move files from one system to another one.

Of course I am not going to discuss its basic operations and the likes. There’s plenty of documentation for that on the GNU web site. What I’d like to share with you all, is some experience I’ve got while using this tool in a networked environment. The powers coming from shell pipes is at work here!

Prerequisites

For proper file access on both the local and the remote systems a few piece of software need to be properly installed and configured. I will use ssh for network operations and sudo for privilege escalation (in case you need to read or write privileged files). The man is the best friend of the sysop. In particular, with ssh it may be needed to put some “ssh key authentication” configuration in place.

Objectives and scenarios

I want to avoid storing a tar archive where it’s not needed. So, for example, if I need to archive some files from host A to host B, I can keep the tar in host B and would like not to store it on host A. Not even temporarily. The reasons can be many but they all boil down to this: I haven’t enough storage room on host A.

There’s also the case in which you don’t store the archive on either end and you just exploit tar for its archive format and features. This is the “system-to-system copy” scenario that I’ll leave as the last (but not least) one.

The possible (interesting) scenarios are basically two:

  1. archiving local files into a remote system
  2. archiving remote files into a local system

We can “explode” these two with some extra variables like data compression and reverse operations like unarchiving (or archive extraction).

In any case, to help the discussion and to ease the operation, I’ll setup three shell variables:

TARGET="/etc /var/spool/cron/crontabs /root/.ssh"
REMOTE="user@remote"
ARCHIV="/tmp/system-\$(date '+%Y%m%d%H%M%S').tar"

The first variable (TARGET) defines the directories and the files to be archived. Be careful with characters that need to be escaped, like blanks, dollar signs and the likes.

The second one (REMOTE) defines the credentials to be used to reach the remote system by ssh.

The third one (ARCHIV) defines the path and the archive naming convention. Please, pay attention to the “escaped dollar“. That command will be run (and will provide data) only when used. Skip that escape and it will run at the definition.

Remote archive

In this scenario, the files and the directories to be archived are local, while the archive itself will be stored on the remote system.

sudo tar cf - ${TARGET} | ssh ${REMOTE} "cat > ${ARCHIV}"

Let’s see how it works.

  1. We ask sudo to escalate the privileges so the subsequent command (tar) can properly access files that aren’t readable and directories that aren’t listable to a normal user.
  2. We ask tar not to produce a tar file but to stream it out of its standard output file with the argument “-“. The stuff to be archived is that in the $TARGET variable.
  3. We pipe (“|”) that output to the standard input of an ssh connection to the $REMOTE system.
  4. On the remote system we ask the local shell to run the cat program that will read from standard input (coming from ssh) and write to the standard output.
  5. On the remote system that standard output will be redirected to a file defined by the variable $ARCHIV. Note that the “\$” will be unsecaped only in this stage.

Local archive

In this scenario, the files and the directories to be archived are on the remote system, while the archive itself will be stored locally.

ssh ${REMOTE} "sudo tar cf - ${TARGET}" | cat > ${ARCHIVE}

Let’s see how it works.

  1. We open and ssh session to the $REMOTE system.
  2. On that system we’ll ask sudo to escalate the privileges for the subsequent command (tar). The reasons are the same as for the step #1 in the previous scenario.
  3. We ask tar the same we asked in the previous scenario, but on the $REMOTE system.
  4. The remote tar file will be streamed to the standard output of the ssh client.
  5. We pipe that output to the standard input of the local command cat.
  6. It’s output is then redirected to the a local file defined by $ARCHIV.

Compression?

We have two options to define where the compression is to happen (local or remore) and three different compression types (gzip, bzip2 and xz). And of course the two scenarios. But we’ll show only one compression type (xz) in the remote archive scenario. “We leave the other cases to be expanded as an exercise for the student”.

In order to have the compression done locally, the original command will be changed like this:

sudo tar cfJ - ${TARGET} | ssh ${REMOTE} "cat > ${ARCHIV}.xz"

The “J” switch enables the xz compression in the tar tool, “z” will make use of the popular gzip and “j” will enable bzip2. The data will be sent compressed over the network for the storage on the remote system.

The other option, with the compression done remotely, the original command will be changed like this:

sudo tar cf - ${TARGET} | ssh ${REMOTE} "cat | xz -9 -c - > ${ARCHIV}.xz"

Here, the archive is sent uncompressed ove the network and piped through the xz compression tool just before being stored on the remote system.

The reasons behind the choice can be different. But basically the system that does the compression will bear the workload.

There’s also a third option for compression: the ssh transparent compression. In case both ends support it (all modern versions of ssh do), the data from the ssh client will be compressed before being transmitted to the server that in turn will uncompress it “on the fly” (thus the transparency).  The tar archive will thus be stored uncompressed. This is achieved with the option “-C” to the ssh client.

sudo tar cf - ${TARGET} | ssh -C ${REMOTE} "cat > ${ARCHIV}"

Copying

As promised, I will show how to copy directories and files between two systems without storing any archive on either end. But, as I am lazy, I’ll show one case and leave all the other ones to the keen reader.

sudo tar cf - ${TARGET} | ssh ${REMOTE} "sudo tar xf - -C /"

Here there is the copy of local files and directories (and all of their meta-data) to the remote system.  As you can see, there’s no mentioning of the $ARCHIV variable as there’s no archive to be stored at all. The basic concept is that a local tar is piped to a remote one though an “ssh channel“.

We also have a few final bits here.

  1. When you run tar for unarchiving (with the “x” switch), you need to specify  where the files and the directories need to be copied. This is what the “-C /” is for: start from the root directory. This is needed especially when the files are archived with absolute paths.
  2. While for archiving with compression you need to specify the compression type, you don’t need to specify the uncompression type: tar is smart enough to understand it on its own.
  3. Be really really really careful with unarchiving, especially when using sudo. File overwrite in Linux/U/nix is unforgiving!

Happy tarring!

 

Thoughts on partitioning in Postgres

Big data. A buzzword echoing everywhere nowadays. More ore less like “AI”. Big data actually means “a real lot of data”. The concept is that a company can (try to) extract extra business information form analyzing a real lot of data coming, for example, an ERP.

Statistics is one type of information. Think for example about seasonal product sales and buyers behaviors.

For such reasons almost all DBMS have started supporting effective ways to handle “really large”™ data tables. But, is this really needed? Aren’t indexes meant to cope with that? The answer is not that straightforward and could be summarized as “yes and no”. Let’s try to understand better also because, under some circumstances, table partitioning is supposed to also provide for faster queries. And we really like to get faster queries especially when that wasn’t our main objective!

Table partitioning means that a very large table will be split into a number of non-overlapping sub-tables (called partitions), each one with its indexes but all seen as a single table. Postgres has provided a way to implement this by means of the table inheritance and table constraints since some time now and is providing an easier way with declarative partitioning since v10.

The way partitioning works is more-or-less the same as with the old table inheritance trick with the declarative implementation being currently little more than syntactic sugar. You have an inheritance tree of tables for querying them as a whole and a not-so-trivial trigger to direct insertions to the right sub-tables. (In movement tables you don’t have deletions and updates!)

The partitioning is done according to a certain criterion that obviously needs to be set up in the table constraint expressions and known to the mentioned trigger function. The declarative partitioing hides all those details behind a simpler syntax. That’s it, I’d say.

The key point in partitioning performance lies, in my opinion, within the algorythm that performs the partition selection. Which is linear in complexity at the moment. This means that if you go from 10 to 100 partitions, the amount of resources needed will grow tenfold. While with a dozen partitions the toll you pay for that algorythm can be negligible, close to zero, with hundreds or thousands it will grow tens or hundred times larger and may be not negligible any more. But keep in mind that the current algorythm hasn’t been designed for partitioning, so I see a lot of room for improvement there.

So what? Should we delay the use of table partitioning? I would say so.

I see a possible approach while waiting for a main rework in that algorythm.

As all queries (both insertions and searches) should be done by software, so also the sub-table selection can be done in software. I call this approach the “software-aware partitioning”. Let’s make an example to try to be clearer. I have a table to record every single item move in and out of a set of stocks. I have 20 stocks, 50K different types of items for hundred of millions of moves a year. And let’s say I keep a couple of years of history.

I can split those data with weekly partitions (52 per year) and by item groups counting 10K items based upon the last 4 digits in their “internal numeric code”™. I’ll end up with about 260 partitions with about 1.5 million rows each. I am assuming here that this partitioning schema makes a lot of sense with the query logics I need to implement. But this holds true for whatever partitioning technology you’ll use.

Instead of relying on whatever implementation I get from my version of Postgres, I move the sub-table selection in software by making very basic maths on the “date of the year” and “internal numeric code” columns (basically a couple of modulo operations) and by dynamically creating a query to be sent to Postgres. By doing so I have moved the logic (and the algorithm) from the generality of Postgres to the specialization of my software. In the end, who knows better that your software which data are needed to be searched for or inserted? Of course, we could need major software rework in case of big changes are needed in the partitioning schema. And, of course, some more programming work is needed if the data is spread among a number of large tables. But, again, this holds true for whatever … you know.

But the current implementations in Postgres allow for an easy disposal of unwanted portions of those big data. A simple DROP TABLE will suffice with a very tiny impact. Maybe it’s just me, but I don’t see this a good reason to embrace the partitioning at this (early?) stage.
Let’s see what the future will bring to us.

SSH through a SOCKS proxy

(Almost) Everyone knows about the SOCKS proxy capabilities provided by SSH. It’s that -D option you read about in the friendly man page. But it’s not an (usual) HTTP proxy: it’s a SOCKS proxy. Early mandatory big warning: this article applies to a “rather recent”™ OpenSSH release. With other releases and implementations YMMV!

I am sure you have found this in all browsers’ proxy configuration section and am also sure just a few of you have dared to use it. The concept is very similar to the more commonly used HTTP proxy, only more general: it works with all TCP protocols (and provides a means for UDP support as well).

But this article (like many similar ones) isn’t about how to establish a SOCKS proxy with SSH. You have that friendly man page for that as well as lots of web pages documenting and tutoring about that.

This article is about establishing an SSH connection by using a SOCKS proxy established by another SSH connection. It seems a corner case, but I’ll explain later why it isn’t. The main reason for this article is that SSH doesn’t provide for a SOCKS proxy to be used directly (otherwise this article would be a real waste of web space). One step at a time.

Let’s start with the first ssh connection, the one establishing the SOCKS proxy:

ssh -T -D 127.1.2.3:1080 username@reachhost

Here I have chosen to bind the proxy to one of the addresses of the loopback, so I can bind more than one proxy on the same client. The TCP port number is 1080, the default one. Finally, reachhost is the name (or address) of the proxying machine. The -T option will prevent a terminal to be required on the proxying machine. With proper automatic authentication mechanisms in place (like SSH keys), the above command can even be run in the background.

We now aim at reaching by SSH a server which is reachable by the reachhost but not from our client. There can be many reasons for that, but they are not relevant now. Let’s call that other server hiddenhost.

In order to be able to accomplish what we are trying to do we need a little helper.

It’s a small but useful piece of software found in almost all Unix/Linux systems. It’s called netcat and is often dubbed as the “TCP/IP swiss army knife“. The basics of netcat is that it connect its stdin and stdout to a TCP (or UDP) socket, as either client (connecting) or server (listening).

This tool exists in at least three different variants:

  1. the (more or less) original one by a developer known as “*Hobbit*”;
  2. the complete rewrite done within the GNU project;
  3. a variant bundled with the nmapscanning tool called ncat.

There are, very likely, many other similar tools that can be exploited for the same purpose. I’ll leave those as an exercise for the keen students.

The SSH client has an option in order to describe how to start a generic proxy before actually starting the whole SSH protocol. The option is ProxyCommand and it will make the SSH client to start that command and use its stdin and stdout to send and receive all the traffic. That option also uses some “metacharacters” (“tokens” in SSH lingo) in order to make the feature really flexible. Thank you OpenSSH team!

Back to netcat now. Netcat (and the listed variants) all have the ability to connect to a  proxy that’s SOCKS v5 by default (but can also be SOCKS v4 or HTTP). Let’s stick with our objective, that’s SOCKS v5.

We then need to start (another) SSH client directed to the unreachable hosts we called hiddenhost. Here it comes!

ssh -o ProxyCommand="nc -x 127.1.2.3:1080 %h %p" hiddenhost

Nota bene: the double quotes are needed because of the blanks inside the command. If you don’t have any shell variable there, you can also use single quotes.

Let’s split and explain the options for the nc command set as the ProxyCommand option value:

  • -x 127.1.2.3:1080 : connect to the SOCKS v5 server (established by the previous ssh command) that’s on port 1080 of 127.1.2.3
  • %h is the token that will evaluate to the actual host to be connected to (hiddenhost)
  • %h is the token that will evaluate to the actual TCP port to be connected to (22 as per default).

This client will first start the nc tool to reach the locally running SOCKS proxy and will then ask it to connect to the hiddenhost on TCP port no.22. Upon success, the client will send and receive all of its traffic through the usual Unix pipes connected to the nc process.

If you are relying on ncat the command will read as:

ssh -o ProxyCommand="ncat --proxy 127.1.2.3:1080 %h %p" hiddenhost

with the same semantics and results.

The possibilities are many more than those shown here: this article is aimed at showing the basics. The friendly man pages for both ssh and nc are there to show you the network side of the Force.

NUMERIC vs REAL in Postgres

precision vs speed … or not?

Intro

I admit it: I’ve never used the REAL data type in my DBs. Only NUMERIC. Both data types are to store a representation of real numbers, but while NUMERIC stores a complete and precise representation, REAL uses a CPU-native approximated representation also known by the name of the standard describing it: IEEE 754.

The differences are basically two:

  1. The CPU-native representation uses a fixed amount of bytes (4 bytes for single precision and 8 bytes for double precision) while NUMERIC uses a number of bytes that depends upon the needed precision.
  2. The CPU-native representation can exploit hardware (CPU built-in) instructions for operations, while NUMERIC requires software implementation for the same operations.

IEEE 754 representations, moreover, use binary formats while humans prefer the base 10 representation. This means that fractional parts of some numbers (those between 0 and 1) have an approximated representation even for 1 single decimal digit. For example, 1/10 that’s (more or less) represented as “0.1” as NUMERIC, becomes “0.000110011…” in binary.

None wants approximated representation when storing ERP data, even at a higher cost in terms of CPU resources. Economic-related data needs to be precise, while scientific data can accept some degree of approximation.

But how high is this cost?

How I tested

A real test (pun is real-ly intentional here!) would require an ERP software to be written to use either IEEE 754 or NUMERIC format and to be run with precise auditing code in order to show the differences, if any.

You’d know I am a little bit lazy, so I won’t follow this path. What I am doing instead is to stress the software implementation used by Postgres to sum NUMERICS agains the hardware implementation.

My test is being run on a 12-thread Intel® Xeon® E5-1650 at 3.20GHz equipped with 64GB RAM, running Ubuntu Linux Server 16.04.3 and Postgres 10.1.

I created three different tables

CREATE TABLE n ( n NUMERIC );
CREATE TABLE r ( r REAL );
CREATE TABLE d ( d DOUBLE PRECISION );

and loaded them with 1 million random real numbers

INSERT INTO n SELECT random() * 10000 FROM generate_series(1, 10000000);
INSERT INTO r SELECT random() * 10000 FROM generate_series(1, 10000000);
INSERT INTO d SELECT random() * 10000 FROM generate_series(1, 10000000);

Then I  run a simple query on each table several times in order to get those data cached in RAM and avoid the costs of the I/O from the disks:

SELECT sum(n) FROM n;
SELECT sum(r) FROM r;
SELECT sum(d) FROM d;

The tests

After having enabled the timing with “\timing” in psql client I have run those three queries 10 times. The results are rather interesting:

...
tmp=# SELECT sum(n) FROM n; 
              sum                
-------------------------------- 
 49986687209.124667565752075342 
(1 row) 
 
Time: 536,900 ms 
tmp=# SELECT sum(r) FROM r; 
     sum      
------------- 
 4.99749e+10 
(1 row) 
 
Time: 301,231 ms 
tmp=# SELECT sum(d) FROM d; 
       sum         
------------------ 
 50011771973.0243 
(1 row) 
 
Time: 320,988 ms 
tmp=# SELECT sum(n) FROM n; 
              sum                
-------------------------------- 
 49986687209.124667565752075342 
(1 row) 
 
Time: 530,584 ms 
tmp=# SELECT sum(r) FROM r; 
     sum      
------------- 
 4.99749e+10 
(1 row) 
 
Time: 296,693 ms 
tmp=# SELECT sum(d) FROM d; 
       sum         
------------------ 
 50011771973.0242 
(1 row) 
 
Time: 325,111 ms

The results

Just a few considerations of mine.

  1. Timing differences between REAL and DOUBLE PRECISION are negligible. Very likely the floating point units (FPUs) use double precision for the operations and honor the actual data size only for the I/O between memory and the registers.
  2. NUMERIC (software) seems to be little less than 2x slower than IEEE 754 (hardware). A part of this difference is due to the software implementation for the sum() function and a part is due to the I/O needed for variable length NUMERIC columns.
  3. NUMERIC, as tested here, is not being used at high precision as the random() function provided by Postgres only produces DOUBLE PRECISION values. This is good as the domain for the tests for NUMERIC and DOUBLE PRECISION is pretty the same.

This test is far from being representative of real life situations. I my opinion, though, it highlights what could be an upper limit to the differences between native and non-native operations. This should mean that in real life situation, where the numeric computation eats a fraction of all the CPU time spent in Postgres, the differences should tend towards zero, thus making the NUMERIC data type the best candidate in ERP applications.

A pro for the IEEE 754 representation (that is the DOUBLE PRECISION data type) against the one used by NUMERIC is the constant storage needed and, due to the actual sizes, also a more effective I/O, in general.