Invalid `pg_dump` file with XML data
Update: This bug is fixed in the latest supported version's minor releases. Upgrade to fix this π bug!
I ran into a problem when moving a database from a production PostGIS-enabled PostgreSQL
server to a development server for testing. Turns out, what I found is a bug in pg_dump
related to
the XML
data type. The problem encountered is filed under
bug #15342.
Tom Lane summarized the issue:
"There are two problems here:
pg_dump
neglects to force a safe value ofxmloption
for the restore step, plus there doesn't seem to be a safe value for it to force :-(."
The rest of this post explores what the problem is, how to tell if you are affected, and your options if you find yourself in this group.
- Who does this affect?
- Data to reproduce the bug
- Check for problematic XML data
- Workaround for part of the problem
- PostgreSQL and XML, and bug #15342
- What to do?
Who does this affect?
This bug affects PostgreSQL databases that contain columns of the XML
data type. More specifically,
it's a problem if any of the XML data includes a <!DOCTYPE>
block.
If you have XML data with the <!DOCTYPE>
block, and also have true XML fragments, you are really affected.
These databases will experience headaches when restoring dump files saved using
the pg_dump
or pg_dumpall
utilities.
The Check for problematic XML data section below provides a query to help determine if your databases are affected.
QGIS Layer Styles
If you have PostGIS databases that support QGIS users and those QGIS users store their styles
in the PostGIS database (look at public.layer_styles
table), this affects you.
I discovered this bug because I am both an admin and analyst user of our PostGIS databases.
I use QGIS and I love that I can have my complex styles saved (and backed up!) directly in the
database with the spatial data itself.
QGIS achieves this by storing its XML style information in a table named
public.layer_styles
in the database storing the PostGIS data.
Style information is stored in XML
format in a column named styleqml
that includes a
document type declaration
(<!DOCTYPE>
). As I mentioned earlier, XML data with <!DOCTYPE>
is at the core of this problem.
Data to reproduce the bug
The following SQL code will create a table with two columns and three rows of data. These three rows of data are sufficient to replicate the error and illustrate the problem. I have replicated this issue on PostgreSQL versions 9.5 through 11.
CREATE TABLE public.xml_doc (
notes TEXT NOT NULL,
data XML NOT NULL
);
-- Example derived from https://wiki.postgresql.org/wiki/XML_Support
INSERT INTO public.xml_doc
SELECT 'Document, no DOCTYPE',
XMLROOT (
XMLELEMENT (
NAME gazonk,
XMLATTRIBUTES (
'val' AS name,
1 + 1 AS num
),
XMLELEMENT (
NAME qux,
'foo'
)
),
VERSION '1.0',
STANDALONE YES
)
;
-- Example XML document simplified from: https://xmlwriter.net/xml_guide/doctype_declaration.shtml
INSERT INTO public.xml_doc
SELECT 'Document, with DOCTYPE',
XMLPARSE (DOCUMENT '<?xml version="1.0" standalone="no" ?>
<!DOCTYPE document SYSTEM "subjects.dtd">
<document>
<title>Subjects available in Mechanical Engineering.</title>
<subjectID>2.303</subjectID>
</document>')
;
-- Example dervied from https://wiki.postgresql.org/wiki/XML_Support
INSERT INTO public.xml_doc
SELECT 'Content fragment',
XMLPARSE (CONTENT 'abc<foo>bar</foo><bar>foo</bar>');
This example is derived from my original example SQL Fiddle. The SQL Fiddle version will not be maintained/updated.
Data loaded
The code above adds three rows of data with three classifications of XML data.
- XML document, no DOCTYPE declaration
- XML document, with DOCTYPE declaration
- XML fragment
Only the Document, no DOCTYPE
row will always load successfully from a pg_dump
backup.
The Content fragment
will load with the default setting, but fails with the workaround.
The Document, with DOCTYPE
fails by default and only works with the workaround.
The following query shows the XML data loaded and checks the XML for valid documents.
SELECT notes, data IS DOCUMENT AS xml_document, data
FROM public.xml_doc;
ββββββββββββββββββββββββββ¬βββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β notes β xml_document β data β
ββββββββββββββββββββββββββͺβββββββββββββββͺβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ‘
β Document, no DOCTYPE β t β <?xml version="1.0" standalone="yes"?><gazonk name="val" num="2"><qux>foo</qux></gazonk> β
β Document, with DOCTYPE β t β <?xml version="1.0" standalone="no"?> β΅β
β β β <!DOCTYPE document SYSTEM "subjects.dtd"> β΅β
β β β <document> β΅β
β β β <title>Subjects available in Mechanical Engineering.</title> β΅β
β β β <subjectID>2.303</subjectID> β΅β
β β β </document> β
β Content fragment β f β abc<foo>bar</foo><bar>foo</bar> β
ββββββββββββββββββββββββββ΄βββββββββββββββ΄βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
(3 rows)
pg_dump
and restore
With example data loaded we can take a backup of the database using pg_dump
.
pg_dump -d xml_test -f xml_test_invalid.sql
On a second PostgreSQL server, create a new database and attempting to restore the backup using psql
:
psql -d postgres -c "CREATE DATABASE restore_here;"
psql -d restore_here -f xml_test_invalid.sql
It starts out so encouraging...
SET
Time: 4.502 ms
SET
Time: 3.987 ms
SET
Time: 4.818 ms
...
But then... ERROR
!
ERROR: invalid XML content
DETAIL: line 2: StartTag: invalid element name
<!DOCTYPE document SYSTEM "subjects.dtd">
^
CONTEXT: COPY xml_doc, line 2, column data: "<?xml version="1.0"
standalone="no"?>
<!DOCTYPE document SYSTEM "subjects.dtd">
<document>
<..."
One particularly unfortunate aspect of this bug is that it doesn't throw an error during the
pg_dump
process. Instead, this bug shows up first when attempting to restore
a pg_dump
file that includes XML data with the <!DOCTYPE>
block.
This illustrates why an untested backup is not a backup! Some problems, like this one, only show up at the tail end of the process.
Check for problematic XML
data
There are three steps to see if (or how badly) you are affected.
- Database includes XML columns?
- XML data includes
<!DOCTYPE>
- XML data includes fragments
Database includes XML
The following query will check the active database for any columns with the XML
data type.
If this query does not return any rows, you do not have XML
columns in this database,
and don't have to worry about this bug.
SELECT table_schema, table_name, column_name
FROM information_schema.columns
WHERE data_type = 'xml';
ββββββββββββββββ¬βββββββββββββ¬ββββββββββββββ
β table_schema β table_name β column_name β
ββββββββββββββββΌβββββββββββββΌββββββββββββββ€
β public β xml_doc β data β
ββββββββββββββββ΄βββββββββββββ΄ββββββββββββββ
(1 row)
If rows are returned, make note of the table(s) and column(s), those will be used in the next two queries.
XML
includes <!DOCTYPE>
If the prior query returned data, you need to check those columns for XML data including <!DOCTYPE>
tags.
The following query shows a rough way to check for this by first
CAST
ing the XML
column (named data
) to TEXT
and using the LIKE
operator.
A count greater than zero (0) indicates your database has data including the DOCTYPE
declaration.
This means, at minimum, you need to use the workaround discussed below in order to use pg_dump
on that
database. If this query returns zero rows, you are not affected by this bug.
SELECT COUNT(*)
FROM public.xml_doc
WHERE data::TEXT LIKE '%<!DOCTYPE%';
βββββββββ
β count β
βββββββββ€
β 1 β
βββββββββ
(1 row)
When using
pg_dumpall
means these considerations apply instance wide, not just to a single database.
Do you have XML
fragments too?
If you have XML
data in your PostgreSQL database including DOCTYPE
, you must also check for XML
fragment data. The reason why is discussed later.
SELECT COUNT(*)
FROM public.xml_doc
WHERE data IS NOT DOCUMENT;
If you have a count greater than zero on this query, and also found XML data with DOCTYPE
, you have a big problem with pg_dump
.
You have bug #15342.
We'll come back to this after discussing the workaround if you only have DOCTYPE XML, no fragments.
Workaround for XML
with DOCTYPE
, but no fragments
If you have XML data with DOCTYPE included in the data (QGIS styles, for example) and no data that is just
XML fragments, the workaround is
to set an option at the top of the script: SET XML OPTION DOCUMENT;
This workaround will work, much of the time (provided no XML fragments...).
Experienced PostGIS / QGIS users have known about this issue the workaround for quite a while,
it was logged as
a QGIS bug back in 2014
that resulted in the QGIS team adding a note about this to
their user manual:
"If you want to make a backup of your PostGIS database using the pg_dump and pg_restore commands, and the default layer styles as saved by QGIS fail to restore afterwards, you need to set the XML option to DOCUMENT and the restore will work."
That means add this:
SET XML OPTION DOCUMENT;
One caveat of this workaround is that you must use the plain SQL format for pg_dump
. The custom
format dump file will not work because you are not able to edit the output to include the XML option.
If this workaround works for you, it's trivial to update your backup commands to start each pg_dump
file with the required command.
echo "SET XML OPTION DOCUMENT;" > ~/tmp/database_with_xml.sql
pg_dump -d database_with_xml >> ~/tmp/database_with_xml.sql
PostgreSQL and XML, and Bug #15342
Before going on to the final part of this bug, let's understand a bit more about how PostgreSQL supports XML. I've been lucky enough to mostly avoid XML data, so I had to dive into the PostgreSQL docs to learn more about how XML is handled.
"The xml type can store well-formed βdocumentsβ, as defined by the XML standard, as well as βcontentβ fragments"
This introduces concepts of XML document and fragments. I had already shown you this concept
above in SQL code with IS DOCUMENT
and IS NOT DOCUMENT
in the queries.
Now refer back to the workaround above, that setting overrides the default setting of CONTENT
with the setting DOCUMENT
. Though this next statement from the
docs seems to indicate the workaround should not be required (emphasis mine):
"The default is
CONTENT
, so all forms of XML data are allowed."
Wait, if all forms of XML data are allowed in CONTENT
as this states, why do we have to use a
workaround to override that setting? There is a note further down in the PostgreSQL docs that seem to explain this:
"With the default XML option setting, you cannot directly cast character strings to type xml if they contain a document type declaration, because the definition of XML content fragment does not accept them. If you need to do that, either use XMLPARSE or change the XML option."
DOCUMENT
setting and Fragments
The above workaround of adding SET XML OPTION DOCUMENT;
was so easy, right? The problem
is when you set this option, XML fragments (accepted by the default CONTENT
setting) are now rejected.
They aren't documents. Re-visiting our query from the beginning, notice Content fragment
is False
.
SELECT notes, data IS DOCUMENT AS xml_document, data
FROM public.xml_doc;
ββββββββββββββββββββββββββ¬βββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β notes β xml_document β data β
ββββββββββββββββββββββββββͺβββββββββββββββͺβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ‘
β Document, no DOCTYPE β t β <?xml version="1.0" standalone="yes"?><gazonk name="val" num="2"><qux>foo</qux></gazonk> β
β Document, with DOCTYPE β t β <?xml version="1.0" standalone="no"?> β΅β
β β β <!DOCTYPE document SYSTEM "subjects.dtd"> β΅β
β β β <document> β΅β
β β β <title>Subjects available in Mechanical Engineering.</title> β΅β
β β β <subjectID>2.303</subjectID> β΅β
β β β </document> β
β Content fragment β f β abc<foo>bar</foo><bar>foo</bar> β
ββββββββββββββββββββββββββ΄βββββββββββββββ΄βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
(3 rows)
Trying to restore the database dump now fails on the record that is an XML fragment, instead of failing
on the record with the <!DOCTYPE>
.
ERROR: invalid XML document
DETAIL: line 1: Start tag expected, '<' not found
abc<foo>bar</foo><bar>foo</bar>
^
CONTEXT: COPY xml_doc, line 3, column data: "abc<foo>bar</foo><bar>foo</bar>"
This illustrates the real problem. Without the workaround, it's impossible to restore XML
data that uses <!DOCTYPE>
.
With the workaround, XML fragments will fail to restore. If you have both...
If your PostgreSQL databases have XML fragments and XML with
<!DOCTYPE>
,pg_dump
will not restore properly.
What to do?
If you have a database that includes XML data making pg_dump
unusable, what can you do?
- Implement more reliable backups
- Evaluate and standardize your XML data
- Split databases / instances
Backups
While pg_dump
is a great utility for specific purposes, it should not be relied upon
as your main backup process. A tool such as
pgbackrest
or
barman
is far better suited for that task.
We use pgbackrest
at RustProof Labs, and those backups do not suffer from the problem
that pg_dump
does in this instance. That discussion is far beyond the scope of this post.
If you currently are scripting database backups using
pg_dump
orpg_dumpall
, it's worth the effort to invest in a more reliable backup solution.
Evaluate and standardize
If your database has XML data with DOCTYPE and fragments both, try to evaluate if one of the two XML types can be converted to either a more/less formal XML data format. The answer to if you can do this will be very context specific to your data, systems, users, and software.
For example, our PostGIS databases are used regularly from QGIS, and I love saving our styles directly
in the database. Because of this, and the fact that QGIS's style XML includes <!DOCTYPE>
, that means having
XML fragments would cause a serious problem. Luckily, this is the only XML our databases contain and I don't
see any new XML sources coming in anytime soon. In our case, the workaround works.
Split data
If you have both <!DOCTYPE>
and XML fragments, and can't convert one to the other, try to split the
different data out into different databases at minimum, but different PostgreSQL instances would be preferred.
Without splitting that data into different instances, pg_dumpall
would be affected even if the XML sources
are in different databases.
Summary
If a PostgreSQL database stores a mix of XML data, specifically including records with
<!DOCTYPE>
blocks, the pg_dump
utility can generate invalid database dump files.
There is a workaround, but it only works if the database does not include any XML fragments.
If you have both <!DOCTYPE>
and fragments, pg_dump
is going to have a bad time.
I don't have a good idea of how many PostgreSQL databases this affects. The PostGIS
community has a large number of users, and many of those databases support QGIS
users storing layers in the public.layer_styles
table.
The workaround should work for many of those users, unless their database also stores fragments of XML as well.
Hopefully this bug is fixed at some point, I will try to update this post to reflect the status when that happens.
Need help with your PostgreSQL servers or databases? Contact us to start the conversation!