Usually when developing in an application with a database, an API in python for example, you might use an object relational mapping (ORM) library like SQLAlchemy or peewee. This can help develop applications quicker by adding a layer of abstraction over SQL and removing boilerplate code to map database rows and tables to Python objects.

Here we explore unexpected (by me atleast!) behaviour of type conversion by the peewee ORM when working with postgres text arrays and python bytes.

1 Peewee ORM

Typically you’d represent a user in the database with the following schema:

id `(serial4)`	username `(text)`	password_hash `(text)`
1	ismailmo1	$2b…mHyyPsAKm

Using peewee, we can represent the user table with the following model definition:

from peewee import PostgresqlDatabase, Model, CharField
import bcrypt

psql_db = PostgresqlDatabase("postgres", host="localhost", port=5432, user="postgres")


class BaseModel(Model):
    """A base model that will use our Postgresql database"""

    class Meta:
        database = psql_db


class User(BaseModel):
    username = CharField()
    password_hash = CharField()

psql_db.create_tables([User])

We can then create a user with the following:

def hash_pwd(pwd) -> bytes:
    return bcrypt.hashpw(pwd.encode("utf-8"), bcrypt.gensalt())

user = User.create(
    username="test_user",
    password_hash=hash_pwd("password")
)

2 Postgres Arrays

Let’s say due to a new security requirement, we require limiting reuse of old passwords. One way we can acheive this by storing old passwords and checking to make sure the new password isn’t in this list.

Array column types in postgres provide a convenient way to store multiple values for a single row without having to create a foreign key and linking to a new table. We can take advantage of this and redefine our User model to have an extra column: previous_password_hashes of type text[] :

from playhouse.postgres_ext import ArrayField

class UserHighSecurity(BaseModel):
    username = CharField()
    password_hash = CharField()
    previous_password_hashes = ArrayField(CharField, null=True)

psql_db.create_tables([UserHighSecurity])

Now, each time a user changes their password we can add the current password to the array in this field (assume we store all previous passwords for simplicity here):

def change_password(user:UserHighSecurity, new_password:str)->None:
    hashed_password = hash_pwd(new_password)
    user.password_hash=hashed_password
    user.previous_password_hashes.insert(0, user.password_hash)
    user.save()

Testing this out with a new user we can see how this column might work in practice:

# create our user
user_high_sec = UserHighSecurity.create(
                                        username="test", 
                                        password_hash = hash_pwd("pass"), 
                                        previous_password_hashes=[]
                                    )

# change password and add current password to previous_password_hashes
change_password(user_high_sec, "new_password")

Postgres stores in the database with the following schema:

id `(serial4)`	username `(text)`	password_hash `(text)`	previous_password_hashes `(text[])`
1	test	$2b…mHyyPsAKm	{“\x2432622…3414b6d”, “\x2432622…641666d”}

3 Implicit type conversion

We can now see the “problem” - the password_hash column has a very different format to the previous_password_hashes column - even though they were populated with the same value (return from hash_pwd()) and they have the same column type definitions in peewee (and postgres)!

For the password_hash field:

python bytes -> postgres text

For the previous_password_hashes field:

python list[bytes] -> postgres text[]

The python bytes (i.e. the output of hash_pwd() are converted to a string representation before being stored as text in postgres - and surprisingly this is done differently for each column.

For the password_hash column, the string is a utf-8 decoded version of the hash_pwd():

hash_pwd("password").decode("utf-8")

'$2b$12$p5f.cDLtdX0VSPf7FyyTzevbXjBEhJJuXhs0d25uSSZZnWhWygXWK'

Whereas for the previous_password_hashes column, we decode with “hex” (postgres adds \x)

hash_pwd("password").hex()

'2432622431322449785a47644f51426954486a684b666c55646c48767563626b633645565a754e43505372426757322e724a75582f424c386f543243'

By chasing down breakpoints in the debugger and digging through the peewee source code, we can find where this type conversion happens for the password_hash column:

# from peewee.py (comments added by me)
class _StringField(Field):
    def adapt(self, value):
        if isinstance(value, text_type):
            return value
        elif isinstance(value, bytes_type):
             # implicitly decode our password hash bytes to utf-8 string
            return value.decode('utf-8')
            #  ^^^^^^^^^^^^^^^^^^^^^^^^^
        return text_type(value)

This adapt method is called on each column/field during generation of a query when the db_value is accessed:

# from peewee.py
class Field(ColumnBase):
    ...
    def db_value(self, value):
            return value if value is None else self.adapt(value)
...
class Insert(_WriteQuery):
    ...
    def _generate_insert(self, insert, ctx):
        ...
        columns_converters = [
            (column, column.db_value if isinstance(column, Field) else None)
            for column in columns]

Although we can see why bytes are stored as utf-8 in postgres text columns, why is this not the same for list[bytes] and text[]?

Ultimately, ORMs will send a SQL query to the underlying database adapter: for postgres this is psycopg, where we can directly access the cursor.query property during execution:

---  when we call change_password()
UPDATE 
    public.userhighsecurity
SET 
    password_hash='$2b...mHyyPsAKm', -- utf-8 decoded
    previous_password_hashes=array['\x2432622...641666d'::bytea] -- hex representation
WHERE 
    id=1;

So peewee will send a hex string as a postgres bytea type for the python list[bytes] and this is documented in the psycopg docs:

Adaptation of Python values to SQL types

Many standard Python types are adapted into SQL and returned as Python objects when a query is executed.

The following table shows the default mapping between Python and PostgreSQL types:

Python PostgreSQL

buffer memoryview bytearray bytes Buffer protocol bytea

Python	PostgreSQL
`buffer` `memoryview` `bytearray` `bytes` Buffer protocol	bytea

And further info on binary adaptation:

Binary adaptation Python types representing binary objects are converted into PostgreSQL binary string syntax, suitable for bytea fields. … Since version 9.0 PostgreSQL uses by default a new “hex” format to emit bytea fields.

This clears it up! Postgres accepts a hex string as input for bytea fields which is why pyscopg converts our python bytes to hex when it writes our insert query.

4 “Fixing” the issue

For our application, we want to ensure that a user doesn’t reuse their current password (as well as previous ones) so we need to store both passwords consistently to avoid having to convert between formats each time (and for consistency in general).

Since our password hashes are already stored as utf-8 decoded strings, we can decode our previous_hashed_passwords before storing them.

The cleanest way is to change the hash_pwd() to return a string as this will avoid any implicit conversion for the existing password_hash by peewee also - explicit is always better than implicit.

def hash_pwd(pwd) -> str:
    return bcrypt.hashpw(pwd.encode("utf-8"), bcrypt.gensalt()).decode("utf-8")

def change_password(user:UserHighSecurity, new_password:str)->None:
    hashed_password = hash_pwd(new_password)
    user.password_hash=hashed_password
    user.previous_password_hashes.insert(0, user.password_hash)
    user.save()

Now when we add values to previous_password_hashes it will be stored the same way as the password_hash columns

change_password(user_high_sec, "new_password")

Generated sql:

-- all bytes converted to string in python code with .decode()
UPDATE 
    public.userhighsecurity
SET 
    password_hash='$2b...mHyyPsAKm', -- utf-8 decoded
    previous_password_hashes=array['$2b...mHyyPsAKm']  -- utf-8 decoded
WHERE 
    id=1;

Updated database:

id `(serial4)`	username `(text)`	password_hash `(text)`	previous_password_hashes `(text[])`
1	test	$2b…mHyyPsAKm	{“$2b…mHyyPsAKm”}

5 Mixing strings and bytes

While exploring this behaviour and switching between inserting raw bytes into postgres and using .decode() to insert strings, I found that when you insert python bytes into a column with existing text[] something unusual happens:

With the state of the user row as above, when we insert bytes into the array:

hashed_password = bcrypt.hashpw("bytes_password".encode(), bcrypt.gensalt())
user_high_sec.previous_password_hashes.insert(0, hashed_password)
user_high_sec.save()

This generates the following sql:

UPDATE 
    public.userhighsecurity
SET 
    password_hash='$2b...mHyyPsAKm',
    previous_password_hashes=array[
        '\x'2432622...e484471''::bytea, -- new inserted value in bytes
        '$2b...mHyyPsAKm']) -- existing utf-8 decoded 
WHERE 
    id=1;

In this case, postgres converts all elements of the array to the bytea type - although I couldn’t find any documentation on this behaviour (please let me know if you know why this happens!)

Interestingly when you use the array_append method directly in postgres you (rightly) get an error message:

UPDATE
    public.userhighsecurity
SET 
    previous_password_hashes = array_append(previous_password_hashes, '\x'2432622...e484471''::bytea)
WHERE
    id = 1;

SQL Error [42883]: ERROR: function array_append(character varying[], bytea) does not exist Hint: No function matches the given name and argument types. You might need to add explicit type casts.

Unfortunately when working with an ORM you lose some control of the underlying SQL, which is usually a worthwhile tradeoff, but it can be useful to peek under the hood sometimes.

6 References

blog title image source: “Python and PostgreSQL without ORM,” Moses Gitau (Jun 5, 2018)