RustProof Labs: blogging for education (logo)

Invalid `pg_dump` file with XML data

By Ryan Lambert -- Published August 25, 2018

Update: This bug is fixed in the latest supported version's minor releases. Upgrade to fix this πŸ› bug!

I ran into a problem when moving a database from a production PostGIS-enabled PostgreSQL server to a development server for testing. Turns out, what I found is a bug in pg_dump related to the XML data type. The problem encountered is filed under bug #15342. Tom Lane summarized the issue:

"There are two problems here: pg_dump neglects to force a safe value of xmloption for the restore step, plus there doesn't seem to be a safe value for it to force :-(."

The rest of this post explores what the problem is, how to tell if you are affected, and your options if you find yourself in this group.

Who does this affect?

This bug affects PostgreSQL databases that contain columns of the XML data type. More specifically, it's a problem if any of the XML data includes a <!DOCTYPE> block. If you have XML data with the <!DOCTYPE> block, and also have true XML fragments, you are really affected. These databases will experience headaches when restoring dump files saved using the pg_dump or pg_dumpall utilities.

The Check for problematic XML data section below provides a query to help determine if your databases are affected.

QGIS Layer Styles

If you have PostGIS databases that support QGIS users and those QGIS users store their styles in the PostGIS database (look at public.layer_styles table), this affects you. I discovered this bug because I am both an admin and analyst user of our PostGIS databases. I use QGIS and I love that I can have my complex styles saved (and backed up!) directly in the database with the spatial data itself.

QGIS achieves this by storing its XML style information in a table named public.layer_styles in the database storing the PostGIS data. Style information is stored in XML format in a column named styleqml that includes a document type declaration (<!DOCTYPE>). As I mentioned earlier, XML data with <!DOCTYPE> is at the core of this problem.

Data to reproduce the bug

The following SQL code will create a table with two columns and three rows of data. These three rows of data are sufficient to replicate the error and illustrate the problem. I have replicated this issue on PostgreSQL versions 9.5 through 11.

CREATE TABLE public.xml_doc (
    notes TEXT NOT NULL,
    data XML NOT NULL
);

-- Example derived from https://wiki.postgresql.org/wiki/XML_Support
INSERT INTO public.xml_doc 
SELECT 'Document, no DOCTYPE', 
    XMLROOT (
  XMLELEMENT (                                       
    NAME gazonk,                                
    XMLATTRIBUTES (
      'val' AS name,
      1 + 1 AS num
    ),
    XMLELEMENT (
      NAME qux,
      'foo'
    )
  ),
  VERSION '1.0',
  STANDALONE YES
)
;

-- Example XML document simplified from:  https://xmlwriter.net/xml_guide/doctype_declaration.shtml
INSERT INTO public.xml_doc 
SELECT 'Document, with DOCTYPE', 
    XMLPARSE (DOCUMENT '<?xml version="1.0" standalone="no" ?>
        <!DOCTYPE document SYSTEM "subjects.dtd">
        <document>
          <title>Subjects available in Mechanical Engineering.</title>
          <subjectID>2.303</subjectID>
        </document>')
;

-- Example dervied from https://wiki.postgresql.org/wiki/XML_Support
INSERT INTO public.xml_doc
SELECT 'Content fragment', 
    XMLPARSE (CONTENT 'abc<foo>bar</foo><bar>foo</bar>');

This example is derived from my original example SQL Fiddle. The SQL Fiddle version will not be maintained/updated.

Data loaded

The code above adds three rows of data with three classifications of XML data.

Only the Document, no DOCTYPE row will always load successfully from a pg_dump backup. The Content fragment will load with the default setting, but fails with the workaround. The Document, with DOCTYPE fails by default and only works with the workaround.

The following query shows the XML data loaded and checks the XML for valid documents.

SELECT notes, data IS DOCUMENT AS xml_document, data
    FROM public.xml_doc;
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         notes          β”‚ xml_document β”‚                                           data                                           β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ══════════════β•ͺ══════════════════════════════════════════════════════════════════════════════════════════║
β”‚ Document, no DOCTYPE   β”‚ t            β”‚ <?xml version="1.0" standalone="yes"?><gazonk name="val" num="2"><qux>foo</qux></gazonk> β”‚
β”‚ Document, with DOCTYPE β”‚ t            β”‚ <?xml version="1.0" standalone="no"?>                                                   ↡│
β”‚                        β”‚              β”‚                 <!DOCTYPE document SYSTEM "subjects.dtd">                               ↡│
β”‚                        β”‚              β”‚                 <document>                                                              ↡│
β”‚                        β”‚              β”‚                   <title>Subjects available in Mechanical Engineering.</title>          ↡│
β”‚                        β”‚              β”‚                   <subjectID>2.303</subjectID>                                          ↡│
β”‚                        β”‚              β”‚                 </document>                                                              β”‚
β”‚ Content fragment       β”‚ f            β”‚ abc<foo>bar</foo><bar>foo</bar>                                                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
(3 rows)

pg_dump and restore

With example data loaded we can take a backup of the database using pg_dump.

pg_dump -d xml_test -f xml_test_invalid.sql

On a second PostgreSQL server, create a new database and attempting to restore the backup using psql:

psql -d postgres -c "CREATE DATABASE restore_here;"
psql -d restore_here -f xml_test_invalid.sql

It starts out so encouraging...

SET
Time: 4.502 ms
SET
Time: 3.987 ms
SET
Time: 4.818 ms
...

But then... ERROR!

ERROR:  invalid XML content
DETAIL:  line 2: StartTag: invalid element name
  <!DOCTYPE document SYSTEM "subjects.dtd">
   ^
CONTEXT:  COPY xml_doc, line 2, column data: "<?xml version="1.0"
standalone="no"?>
  <!DOCTYPE document SYSTEM "subjects.dtd">
  <document>
    <..."

One particularly unfortunate aspect of this bug is that it doesn't throw an error during the pg_dump process. Instead, this bug shows up first when attempting to restore a pg_dump file that includes XML data with the <!DOCTYPE> block.

This illustrates why an untested backup is not a backup! Some problems, like this one, only show up at the tail end of the process.

Check for problematic XML data

There are three steps to see if (or how badly) you are affected.

Database includes XML

The following query will check the active database for any columns with the XML data type. If this query does not return any rows, you do not have XML columns in this database, and don't have to worry about this bug.

SELECT table_schema, table_name, column_name
    FROM information_schema.columns 
    WHERE data_type = 'xml';

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ table_schema β”‚ table_name β”‚ column_name β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ public       β”‚ xml_doc    β”‚ data        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
(1 row)

If rows are returned, make note of the table(s) and column(s), those will be used in the next two queries.

XML includes <!DOCTYPE>

If the prior query returned data, you need to check those columns for XML data including <!DOCTYPE> tags. The following query shows a rough way to check for this by first CASTing the XML column (named data) to TEXT and using the LIKE operator.

A count greater than zero (0) indicates your database has data including the DOCTYPE declaration. This means, at minimum, you need to use the workaround discussed below in order to use pg_dump on that database. If this query returns zero rows, you are not affected by this bug.

SELECT COUNT(*)
    FROM public.xml_doc
    WHERE data::TEXT LIKE '%<!DOCTYPE%';

β”Œβ”€β”€β”€β”€β”€β”€β”€β”
β”‚ count β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€
β”‚     1 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”˜
(1 row)

When using pg_dumpall means these considerations apply instance wide, not just to a single database.

Do you have XML fragments too?

If you have XML data in your PostgreSQL database including DOCTYPE, you must also check for XML fragment data. The reason why is discussed later.

SELECT COUNT(*)
        FROM public.xml_doc
        WHERE data IS NOT DOCUMENT;

If you have a count greater than zero on this query, and also found XML data with DOCTYPE, you have a big problem with pg_dump. You have bug #15342. We'll come back to this after discussing the workaround if you only have DOCTYPE XML, no fragments.

Workaround for XML with DOCTYPE, but no fragments

If you have XML data with DOCTYPE included in the data (QGIS styles, for example) and no data that is just XML fragments, the workaround is to set an option at the top of the script: SET XML OPTION DOCUMENT; This workaround will work, much of the time (provided no XML fragments...). Experienced PostGIS / QGIS users have known about this issue the workaround for quite a while, it was logged as a QGIS bug back in 2014 that resulted in the QGIS team adding a note about this to their user manual:

"If you want to make a backup of your PostGIS database using the pg_dump and pg_restore commands, and the default layer styles as saved by QGIS fail to restore afterwards, you need to set the XML option to DOCUMENT and the restore will work."

That means add this:

SET XML OPTION DOCUMENT;

One caveat of this workaround is that you must use the plain SQL format for pg_dump. The custom format dump file will not work because you are not able to edit the output to include the XML option.

If this workaround works for you, it's trivial to update your backup commands to start each pg_dump file with the required command.

echo "SET XML OPTION DOCUMENT;" > ~/tmp/database_with_xml.sql
pg_dump -d database_with_xml >> ~/tmp/database_with_xml.sql

PostgreSQL and XML, and Bug #15342

Before going on to the final part of this bug, let's understand a bit more about how PostgreSQL supports XML. I've been lucky enough to mostly avoid XML data, so I had to dive into the PostgreSQL docs to learn more about how XML is handled.

"The xml type can store well-formed β€œdocuments”, as defined by the XML standard, as well as β€œcontent” fragments"

This introduces concepts of XML document and fragments. I had already shown you this concept above in SQL code with IS DOCUMENT and IS NOT DOCUMENT in the queries.

Now refer back to the workaround above, that setting overrides the default setting of CONTENT with the setting DOCUMENT. Though this next statement from the docs seems to indicate the workaround should not be required (emphasis mine):

"The default is CONTENT, so all forms of XML data are allowed."

Wait, if all forms of XML data are allowed in CONTENT as this states, why do we have to use a workaround to override that setting? There is a note further down in the PostgreSQL docs that seem to explain this:

"With the default XML option setting, you cannot directly cast character strings to type xml if they contain a document type declaration, because the definition of XML content fragment does not accept them. If you need to do that, either use XMLPARSE or change the XML option."

DOCUMENT setting and Fragments

The above workaround of adding SET XML OPTION DOCUMENT; was so easy, right? The problem is when you set this option, XML fragments (accepted by the default CONTENT setting) are now rejected. They aren't documents. Re-visiting our query from the beginning, notice Content fragment is False.

SELECT notes, data IS DOCUMENT AS xml_document, data
    FROM public.xml_doc;
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         notes          β”‚ xml_document β”‚                                           data                                           β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ══════════════β•ͺ══════════════════════════════════════════════════════════════════════════════════════════║
β”‚ Document, no DOCTYPE   β”‚ t            β”‚ <?xml version="1.0" standalone="yes"?><gazonk name="val" num="2"><qux>foo</qux></gazonk> β”‚
β”‚ Document, with DOCTYPE β”‚ t            β”‚ <?xml version="1.0" standalone="no"?>                                                   ↡│
β”‚                        β”‚              β”‚                 <!DOCTYPE document SYSTEM "subjects.dtd">                               ↡│
β”‚                        β”‚              β”‚                 <document>                                                              ↡│
β”‚                        β”‚              β”‚                   <title>Subjects available in Mechanical Engineering.</title>          ↡│
β”‚                        β”‚              β”‚                   <subjectID>2.303</subjectID>                                          ↡│
β”‚                        β”‚              β”‚                 </document>                                                              β”‚
β”‚ Content fragment       β”‚ f            β”‚ abc<foo>bar</foo><bar>foo</bar>                                                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
(3 rows)

Trying to restore the database dump now fails on the record that is an XML fragment, instead of failing on the record with the <!DOCTYPE>.

ERROR:  invalid XML document
DETAIL:  line 1: Start tag expected, '<' not found
abc<foo>bar</foo><bar>foo</bar>
^
CONTEXT:  COPY xml_doc, line 3, column data: "abc<foo>bar</foo><bar>foo</bar>"

This illustrates the real problem. Without the workaround, it's impossible to restore XML data that uses <!DOCTYPE>. With the workaround, XML fragments will fail to restore. If you have both...

If your PostgreSQL databases have XML fragments and XML with <!DOCTYPE>, pg_dump will not restore properly.

What to do?

If you have a database that includes XML data making pg_dump unusable, what can you do?

Backups

While pg_dump is a great utility for specific purposes, it should not be relied upon as your main backup process. A tool such as pgbackrest or barman is far better suited for that task. We use pgbackrest at RustProof Labs, and those backups do not suffer from the problem that pg_dump does in this instance. That discussion is far beyond the scope of this post.

If you currently are scripting database backups using pg_dump or pg_dumpall, it's worth the effort to invest in a more reliable backup solution.

Evaluate and standardize

If your database has XML data with DOCTYPE and fragments both, try to evaluate if one of the two XML types can be converted to either a more/less formal XML data format. The answer to if you can do this will be very context specific to your data, systems, users, and software.

For example, our PostGIS databases are used regularly from QGIS, and I love saving our styles directly in the database. Because of this, and the fact that QGIS's style XML includes <!DOCTYPE>, that means having XML fragments would cause a serious problem. Luckily, this is the only XML our databases contain and I don't see any new XML sources coming in anytime soon. In our case, the workaround works.

Split data

If you have both <!DOCTYPE> and XML fragments, and can't convert one to the other, try to split the different data out into different databases at minimum, but different PostgreSQL instances would be preferred. Without splitting that data into different instances, pg_dumpall would be affected even if the XML sources are in different databases.

Summary

If a PostgreSQL database stores a mix of XML data, specifically including records with <!DOCTYPE> blocks, the pg_dump utility can generate invalid database dump files. There is a workaround, but it only works if the database does not include any XML fragments. If you have both <!DOCTYPE> and fragments, pg_dump is going to have a bad time.

I don't have a good idea of how many PostgreSQL databases this affects. The PostGIS community has a large number of users, and many of those databases support QGIS users storing layers in the public.layer_styles table. The workaround should work for many of those users, unless their database also stores fragments of XML as well.

Hopefully this bug is fixed at some point, I will try to update this post to reflect the status when that happens.

Need help with your PostgreSQL servers or databases? Contact us to start the conversation!

By Ryan Lambert
Published August 25, 2018
Last Updated May 02, 2019