from peewee import PostgresqlDatabase, Model, CharField
import bcrypt
= PostgresqlDatabase("postgres", host="localhost", port=5432, user="postgres")
psql_db
class BaseModel(Model):
"""A base model that will use our Postgresql database"""
class Meta:
= psql_db
database
class User(BaseModel):
= CharField()
username = CharField()
password_hash
psql_db.create_tables([User])
Usually when developing in an application with a database, an API in python for example, you might use an object relational mapping (ORM) library like SQLAlchemy or peewee. This can help develop applications quicker by adding a layer of abstraction over SQL and removing boilerplate code to map database rows and tables to Python objects.
Here we explore unexpected (by me atleast!) behaviour of type conversion by the peewee ORM when working with postgres text arrays and python bytes.
1 Peewee ORM
Typically you’d represent a user in the database with the following schema:
id (serial4) |
username (text) |
password_hash (text) |
---|---|---|
1 | ismailmo1 | $2b…mHyyPsAKm |
Using peewee, we can represent the user table with the following model definition:
We can then create a user with the following:
def hash_pwd(pwd) -> bytes:
return bcrypt.hashpw(pwd.encode("utf-8"), bcrypt.gensalt())
= User.create(
user ="test_user",
username=hash_pwd("password")
password_hash )
2 Postgres Arrays
Let’s say due to a new security requirement, we require limiting reuse of old passwords. One way we can acheive this by storing old passwords and checking to make sure the new password isn’t in this list.
Array
column types in postgres provide a convenient way to store multiple values for a single row without having to create a foreign key and linking to a new table. We can take advantage of this and redefine our User
model to have an extra column: previous_password_hashes
of type text[]
:
from playhouse.postgres_ext import ArrayField
class UserHighSecurity(BaseModel):
= CharField()
username = CharField()
password_hash = ArrayField(CharField, null=True)
previous_password_hashes
psql_db.create_tables([UserHighSecurity])
Now, each time a user changes their password we can add the current password to the array in this field (assume we store all previous passwords for simplicity here):
def change_password(user:UserHighSecurity, new_password:str)->None:
= hash_pwd(new_password)
hashed_password =hashed_password
user.password_hash0, user.password_hash)
user.previous_password_hashes.insert( user.save()
Testing this out with a new user we can see how this column might work in practice:
# create our user
= UserHighSecurity.create(
user_high_sec ="test",
username= hash_pwd("pass"),
password_hash =[]
previous_password_hashes
)
# change password and add current password to previous_password_hashes
"new_password") change_password(user_high_sec,
Postgres stores in the database with the following schema:
id (serial4) |
username (text) |
password_hash (text) |
previous_password_hashes (text[]) |
---|---|---|---|
1 | test | $2b…mHyyPsAKm | {“\x2432622…3414b6d”, “\x2432622…641666d”} |
3 Implicit type conversion
We can now see the “problem” - the password_hash
column has a very different format to the previous_password_hashes
column - even though they were populated with the same value (return from hash_pwd()
) and they have the same column type definitions in peewee (and postgres)!
For the password_hash
field:
- python
bytes
-> postgrestext
For the previous_password_hashes
field:
- python
list[bytes]
-> postgrestext[]
The python bytes
(i.e. the output of hash_pwd()
are converted to a string representation before being stored as text
in postgres - and surprisingly this is done differently for each column.
For the password_hash
column, the string is a utf-8
decoded version of the hash_pwd()
:
"password").decode("utf-8") hash_pwd(
'$2b$12$p5f.cDLtdX0VSPf7FyyTzevbXjBEhJJuXhs0d25uSSZZnWhWygXWK'
Whereas for the previous_password_hashes
column, we decode with “hex” (postgres adds \x
)
"password").hex() hash_pwd(
'2432622431322449785a47644f51426954486a684b666c55646c48767563626b633645565a754e43505372426757322e724a75582f424c386f543243'
By chasing down breakpoints in the debugger and digging through the peewee source code, we can find where this type conversion happens for the password_hash
column:
# from peewee.py (comments added by me)
class _StringField(Field):
def adapt(self, value):
if isinstance(value, text_type):
return value
elif isinstance(value, bytes_type):
# implicitly decode our password hash bytes to utf-8 string
return value.decode('utf-8')
# ^^^^^^^^^^^^^^^^^^^^^^^^^
return text_type(value)
This adapt
method is called on each column/field during generation of a query when the db_value
is accessed:
# from peewee.py
class Field(ColumnBase):
...def db_value(self, value):
return value if value is None else self.adapt(value)
...class Insert(_WriteQuery):
...def _generate_insert(self, insert, ctx):
...= [
columns_converters if isinstance(column, Field) else None)
(column, column.db_value for column in columns]
Although we can see why bytes
are stored as utf-8
in postgres text
columns, why is this not the same for list[bytes]
and text[]
?
Ultimately, ORMs will send a SQL query to the underlying database adapter: for postgres this is psycopg
, where we can directly access the cursor.query
property during execution:
--- when we call change_password()
UPDATE
public.userhighsecurity
SET
='$2b...mHyyPsAKm', -- utf-8 decoded
password_hash=array['\x2432622...641666d'::bytea] -- hex representation
previous_password_hashesWHERE
id=1;
So peewee will send a hex string as a postgres bytea
type for the python list[bytes]
and this is documented in the psycopg
docs:
Adaptation of Python values to SQL types
Many standard Python types are adapted into SQL and returned as Python objects when a query is executed.
The following table shows the default mapping between Python and PostgreSQL types:
Python PostgreSQL buffer
memoryview
bytearray
bytes
Buffer protocolbytea
And further info on binary adaptation:
Binary adaptation Python types representing binary objects are converted into PostgreSQL binary string syntax, suitable for bytea fields. … Since version 9.0 PostgreSQL uses by default a new “hex” format to emit bytea fields.
This clears it up! Postgres accepts a hex string as input for bytea
fields which is why pyscopg
converts our python bytes
to hex when it writes our insert query.
4 “Fixing” the issue
For our application, we want to ensure that a user doesn’t reuse their current password (as well as previous ones) so we need to store both passwords consistently to avoid having to convert between formats each time (and for consistency in general).
Since our password hashes are already stored as utf-8
decoded strings, we can decode our previous_hashed_passwords
before storing them.
The cleanest way is to change the hash_pwd()
to return a string as this will avoid any implicit conversion for the existing password_hash
by peewee also - explicit is always better than implicit.
def hash_pwd(pwd) -> str:
return bcrypt.hashpw(pwd.encode("utf-8"), bcrypt.gensalt()).decode("utf-8")
def change_password(user:UserHighSecurity, new_password:str)->None:
= hash_pwd(new_password)
hashed_password =hashed_password
user.password_hash0, user.password_hash)
user.previous_password_hashes.insert( user.save()
Now when we add values to previous_password_hashes
it will be stored the same way as the password_hash
columns
"new_password") change_password(user_high_sec,
Generated sql:
-- all bytes converted to string in python code with .decode()
UPDATE
public.userhighsecurity
SET
='$2b...mHyyPsAKm', -- utf-8 decoded
password_hash=array['$2b...mHyyPsAKm'] -- utf-8 decoded
previous_password_hashesWHERE
id=1;
Updated database:
id (serial4) |
username (text) |
password_hash (text) |
previous_password_hashes (text[]) |
---|---|---|---|
1 | test | $2b…mHyyPsAKm | {“$2b…mHyyPsAKm”} |
5 Mixing strings and bytes
While exploring this behaviour and switching between inserting raw bytes into postgres and using .decode()
to insert strings, I found that when you insert python bytes
into a column with existing text[]
something unusual happens:
With the state of the user row as above, when we insert bytes
into the array:
= bcrypt.hashpw("bytes_password".encode(), bcrypt.gensalt())
hashed_password 0, hashed_password)
user_high_sec.previous_password_hashes.insert( user_high_sec.save()
This generates the following sql:
UPDATE
public.userhighsecurity
SET
='$2b...mHyyPsAKm',
password_hash=array[
previous_password_hashes'\x'2432622...e484471''::bytea, -- new inserted value in bytes
'$2b...mHyyPsAKm']) -- existing utf-8 decoded
WHERE
id=1;
In this case, postgres converts all elements of the array to the bytea
type - although I couldn’t find any documentation on this behaviour (please let me know if you know why this happens!)
Interestingly when you use the array_append
method directly in postgres you (rightly) get an error message:
UPDATE
public.userhighsecurity
SET
= array_append(previous_password_hashes, '\x'2432622...e484471''::bytea)
previous_password_hashes WHERE
id = 1;
SQL Error [42883]: ERROR: function array_append(character varying[], bytea) does not exist Hint: No function matches the given name and argument types. You might need to add explicit type casts.
Unfortunately when working with an ORM you lose some control of the underlying SQL, which is usually a worthwhile tradeoff, but it can be useful to peek under the hood sometimes.
6 References
- https://www.psycopg.org/docs/usage.html#python-types-adaptation
- https://www.psycopg.org/docs/usage.html#binary-adaptation
- https://www.postgresql.org/docs/16/sql-expressions.html#SQL-SYNTAX-TYPE-CASTS
- https://www.postgresql.org/docs/current/arrays.html
- https://www.postgresql.org/docs/current/datatype-binary.html
blog title image source: “Python and PostgreSQL without ORM,” Moses Gitau (Jun 5, 2018)