Skip to content

dump metadata along with document fragment #2019

Closed
zackw opened this Issue · 18 comments

6 participants

@zackw

I am looking for a way to get pandoc to dump out metadata along with a document fragment. This should behave as follows:

  • There should be some unambiguous syntactic mechanism to distinguish the metadata from the document fragment. I don't care what that is.
  • All metadata should be dumped, not just what is meaningful to pandoc itself or a template.
  • It should be dumped out as JSON, faithfully reproducing whatever data structure was read in (for instance, YAML lists should be preserved)
  • Strings should be rendered using the same markup as the document fragment.

An example would probably help: given

---
authors: [joe bloggs, fred mbogo]
title: This title contains *emphasis* and $m$-ath
...

This is the body of the document

pandoc -t html5+metadata --mathml should produce something like

<!-- metadata:
{"authors":["joe bloggs","fred mbogo"],"title":"This title contains <em>emphasis</em> and <math><mrow><mi>m</mi></mrow></math>-ath"}
:metadata -->
<p>This is the body of the document</p>

It's quite possible that there's already a way to do something like this and I just can't find it, in which case I would appreciate a pointer.

A way to dump only the metadata, but still applying a rendering, would also be useful.

@lierdakil

This could be done with a filter, e.g.

import Text.Pandoc.JSON
import Text.Pandoc
import Data.Aeson.Encode
import Data.Aeson.Types
import Data.ByteString.Lazy.UTF8
import Data.List
import qualified Data.Map as M

main :: IO ()
main = toJSONFilter inputMeta

inputMeta :: Pandoc -> Pandoc
inputMeta (Pandoc m b) = Pandoc m (mb:b)
  where
    mb = RawBlock (Format "html") $
      "<!-- metadata:\n" ++ toString (encode $ metaToJSON m) ++ "\n-->"

metaToJSON :: Meta -> Value
metaToJSON (Meta m) = toJSON $ M.map metaValueToJSON m

metaValueToJSON :: MetaValue -> Value
metaValueToJSON (MetaMap m) = toJSON $ M.map metaValueToJSON m
metaValueToJSON (MetaList xs) = toJSON $ map metaValueToJSON xs
metaValueToJSON (MetaString t) = toJSON t
metaValueToJSON (MetaBool b) = toJSON b
metaValueToJSON (MetaInlines ils) = toJSON $ toHtml ils
metaValueToJSON (MetaBlocks bs) = toJSON $ toHtml' bs

toHtml :: [Inline] -> String
toHtml ils = html
  where
    html = writeHtmlString options $ Pandoc nullMeta [Plain ils]

toHtml' :: [Block] -> String
toHtml' bs = writeHtmlString options $ Pandoc nullMeta bs

options :: WriterOptions
options = def{writerHTMLMathMethod=MathML Nothing}

I don't think this makes a ton of sense as a core functionality. I would, however, appreciate a built-in metaToJSON/metaValueToJSON, as well as methods to translate from Inlines to String for a given format without weird prefix-stripping. The latter doesn't make sense for all formats though. UPD: Silly me, there is Plain block-level element for that

@lierdakil

Note, that whatever you want this for, you are probably better off just straight up writing a filter for it. You can choose between Haskell, Python, or in fact any language that can handle JSON input and output (e.g. NodeJS). Haskell and Python are supported though. You might want to look at http://johnmacfarlane.net/pandoc/scripting.html

@zackw

I have experimented with this approach, and I think I can make it work, but it is suboptimal.

For context, I am trying to improve the metadata handling in liob/pandoc_reader, which uses Pandoc as the front end for a static site generator, Pelican; Pelican is written in Python. In this context, I am reluctant to require the Haskell compiler or the Pandoc libraries; the current code uses only the command-line tool.

Now, if I'm writing a filter in other-than-Haskell, I can't get at writeHtmlString, so the best I can do is some kind of AST-to-AST transformation that embeds the metadata in the HTML output, preserving its structure. For instance, I can translate MetaList to BulletList, and MetaMap and the top-level metadata object to DefinitionList, and wrap value types in Plain. Let me give a concrete example of the complex metadata I'm working with, and the output of the transformation I have written:

---
authors:
  - Li, Ninghui
  - Li, Tiancheng
  - Venkatasubramanian, S.
title: "$t$-Closeness: Privacy Beyond $k$-Anonymity and $l$-Diversity"
booktitle:
  shortname: ICDE 2007
  fullname: IEEE 23rd International Conference on Data Engineering, 2007
  url: http://www.computer.org/csdl/proceedings/icde/2007/0802/00/index.html
year: 2007
month: April
pages: 106--115
doi: 10.1109/ICDE.2007.367856
tags: [data privacy, database theory, attribute disclosure,
       $k$-anonymity, $l$-diversity, $t$-closeness]
...

body of document

becomes

<dl>
<dt>authors</dt>
<dd><ul>
<li>Li, Ninghui</li>
<li>Li, Tiancheng</li>
<li>Venkatasubramanian, S.</li>
</ul>
</dd>
<dt>booktitle</dt>
<dd><dl>
<dt>fullname</dt>
<dd>IEEE 23rd International Conference on Data Engineering, 2007
</dd>
<dt>shortname</dt>
<dd>ICDE 2007
</dd>
<dt>url</dt>
<dd>http://www.computer.org/csdl/proceedings/icde/2007/0802/00/index.html
</dd>
</dl>
</dd>
<dt>doi</dt>
<dd>10.1109/ICDE.2007.367856
</dd>
<dt>month</dt>
<dd>April
</dd>
<dt>pages</dt>
<dd>106--115
</dd>
<dt>tags</dt>
<dd><ul>
<li>data privacy</li>
<li>database theory</li>
<li>attribute disclosure</li>
<li><math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mi>k</mi></mrow></math>-anonymity</li>
<li><math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mi>l</mi></mrow></math>-diversity</li>
<li><math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mi>t</mi></mrow></math>-closeness</li>
</ul>
</dd>
<dt>title</dt>
<dd><math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mi>t</mi></mrow></math>-Closeness: Privacy Beyond <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mi>k</mi></mrow></math>-Anonymity and <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mi>l</mi></mrow></math>-Diversity
</dd>
<dt>year</dt>
<dd>2007
</dd>
</dl>
<hr />
<p>body of document</p>

which I would then split at the <hr /> (incidentally, why does -t html5 emit XMLisms?) and parse the top half of back into a data structure. This is less than ideal for two reasons. First, parsing HTML is significantly more complicated than parsing JSON as originally requested. There is no way to generate JSON with this approach, because there is no way to direct Pandoc to render the contents of an (ex-) MetaInlines or MetaBlocks as HTML and then quote it for JSON. Second, closely related, there's no definite break point between the HTML defining the data structure, and the HTML of each value. I may be able to patch around that with <span> or something, but it'll never be better than awkward.

@zackw

Thinking out loud, a potential fix is (1) a new AST node type that means "render what's under this node in output format X and then quote it as a string literal for the surrounding context", (2) some way of generating a custom JSON tree (rather than a literal serialization of the AST). (1) might also be useful for, like, embedding examples of the rendered output in format X in a document of format Y.

The thing I originally asked for seems simpler overall, and easier to implement, though.

@mpickering
Collaborator

Can you not write a filter which just dumps the JSON to a file? If you then really want them in the same file you can then just cat the metadata dump and the output pandoc produces together.

@zackw

@mpickering The JSON structure passed to the filter is

[{ "unMeta": {
    "title": {"t":"MetaInlines","c":[
        {"t":"Math","c":[{"t":"InlineMath","c":[]},"t"]},
        {"t":"Str","c":"-Closeness:"},
        {"t":"Space","c":[]},
        {"t":"Str","c":"Privacy"},
        {"t":"Space","c":[]},
        {"t":"Str","c":"Beyond"},
        {"t":"Space","c":[]},
        {"t":"Math","c":[{"t":"InlineMath","c":[]},"k"]},
        {"t":"Str","c":"-Anonymity"},
        {"t":"Space","c":[]},
        {"t":"Str","c":"and"},
        {"t":"Space","c":[]},
        {"t":"Math","c":[{"t":"InlineMath","c":[]},"l"]},
        {"t":"Str","c":"-Diversity"}
    ]},
    "authors": {"t":"MetaList","c":[
        {"t":"MetaInlines","c":[
            {"t":"Str","c":"Li,"},
            {"t":"Space","c":[]},
            {"t":"Str","c":"Ninghui"}
        ]},
        {"t":"MetaInlines","c":[
            {"t":"Str","c":"Li,"},
            {"t":"Space","c":[]},
            {"t":"Str","c":"Tiancheng"}
        ]},
        {"t":"MetaInlines","c":[
            {"t":"Str","c":"Venkatasubramanian,"},
            {"t":"Space","c":[]},
            {"t":"Str","c":"S."}]}
    ]}
    // ...
}},
[/*body of document here */]]

The JSON structure I want is

{
    "title": "<math display=\"inline\"><mrow><mi>t</mi></mrow></math>-Closeness: Privacy Beyond <math display=\"inline\"><mrow><mi>k</mi></mrow></math>-Anonymity and <math display=\"inline\"><mrow><mi>l</mi></mrow></math>-Diversity",
    "authors": [
        "Li, Ninghui",
        "Li, Tiancheng",
        "Venkatasubramanian, S."
    ],
    // ...
}

The only way to get to B from A is to pass back through Pandoc's HTML generator.

@jgm
Owner
@zackw

if you know the format of the structure ahead of time, could you just write a custom template?

I don't know the structure ahead of time; it appears that there is no way to iterate over all available variables, nor discriminate variables by origin, nor to recursively walk an unknown tree structure.

Also, it appears that there is no way to request any sort of syntactic quotation.

@lierdakil

I am reluctant to require the Haskell compiler or the Pandoc libraries; the current code uses only the command-line tool.

Look, in absolute majority of cases, if pandoc is installed, so is haskell runtime. It means that, at the very least, you can run haskell filters through pandoc itself. It is suboptimal in terms of speed, but since it's not used for dynamic content generation of some sort, it shouldn't be a big concern.

@jgm
Owner
@jgm
Owner

The following change to the HTML writer would add a meta-json template variable containing a JSON version of the formatted metadata:

diff --git a/src/Text/Pandoc/Writers/HTML.hs b/src/Text/Pandoc/Writers/HTML.hs
index 53dc931..93834c1 100644
--- a/src/Text/Pandoc/Writers/HTML.hs
+++ b/src/Text/Pandoc/Writers/HTML.hs
@@ -43,6 +43,8 @@ import Text.Pandoc.XML (fromEntities, escapeStringForXML)
 import Network.URI ( parseURIReference, URI(..), unEscapeString )
 import Network.HTTP ( urlEncode )
 import Numeric ( showHex )
+import qualified Data.Aeson as Aeson
+import Text.Pandoc.UTF8 (toStringLazy)
 import Data.Char ( ord, toLower )
 import Data.List ( isPrefixOf, intersperse )
 import Data.String ( fromString )
@@ -194,6 +196,7 @@ pandocToHtml opts (Pandoc meta blocks) = do
                   defField "revealjs-url" ("reveal.js" :: String) $
                   defField "s5-url" ("s5/default" :: String) $
                   defField "html5" (writerHtml5 opts) $
+                  defField "meta-json" (toStringLazy $ Aeson.encode metadata) $
                   metadata
   return (thebody, context)

This could be used with a custom template like

<!--
$meta-json$
-->
$body$

to get what @zackw is looking for.

So, one possible change to pandoc would be to define a meta-json variables in all writers. Rather than changing all the writers one by one, it would make sense to modify the metaToJSON function. I can see how this would make it easier to integrate pandoc with other things, like static site generators. What do people think?

@jgm
Owner

Better, more general, patch, affecting all writers:

diff --git a/src/Text/Pandoc/Writers/Shared.hs b/src/Text/Pandoc/Writers/Shared.hs
index 800e741..cc9e59d 100644
--- a/src/Text/Pandoc/Writers/Shared.hs
+++ b/src/Text/Pandoc/Writers/Shared.hs
@@ -45,7 +45,8 @@ import Text.Pandoc.Options (WriterOptions(..))
 import qualified Data.HashMap.Strict as H
 import qualified Data.Map as M
 import qualified Data.Text as T
-import Data.Aeson (FromJSON(..), fromJSON, ToJSON (..), Value(Object), Result(..))
+import Data.Aeson (FromJSON(..), fromJSON, ToJSON (..), Value(Object), Result(..), encode)
+import Text.Pandoc.UTF8 (toStringLazy)
 import qualified Data.Traversable as Traversable
 import Data.List ( groupBy )

@@ -67,7 +68,8 @@ metaToJSON opts blockWriter inlineWriter (Meta metamap)
     renderedMap <- Traversable.mapM
                    (metaValueToJSON blockWriter inlineWriter)
                    metamap
-    return $ M.foldWithKey defField baseContext renderedMap
+    let metadata = M.foldWithKey defField baseContext renderedMap
+    return $ defField "meta-json" (toStringLazy $ encode metadata) metadata
   | otherwise = return (Object H.empty)

 metaValueToJSON :: Monad m
@zackw

I like this as long as it does the Right Thing with complicated quoting cases like

---
title: "`<!-- HTML Comments And You -->`: An \"Informal\" Discussion"
author: Alice & Bob
...

this being only what I could think of off the top of my head, I'm sure there are nastier constructs.

@jgm
Owner
@zackw zackw referenced this issue in liob/pandoc_reader
Closed

Add support for parsing YAML metadata. #5

@chriskrycho

Adding a :+1: here, because I have very similar needs to those outlined by @zackw, and he and I ended up independently working around the issue elsewhere (see liob/pandoc_reader#3, liob/pandoc_reader#4, and liob/pandoc_reader#5). I also note that integration with other possible static site generators could be a big win, since pandoc is in my experience meaningfully faster than many other implementations. (E.g. it moves at least twice as fast as the standard Python Markdown implementation—I just compared the two on a ~16k-line test file with as close to the same settings for parsing as possible, and it runs in half the time. For a file 10⨉ that size… well, Python Markdown just falls down; it never finished. :stuck_out_tongue:)

@bpj
@chriskrycho

@bpj That's true in a general sense, but it doesn't get at the issue here, and it certainly doesn't give you the data back in a format (e.g. JSON) readily transformed or handed around within another application, which is the context which drove @zackw's request (and is my interest as well): both of us are using pandoc to drive Pelican, and are doing a bit of a dance to handle YAML metadata in that context.

@jgm jgm added a commit that closed this issue
@jgm Define a `meta-json` variable for all writers.
This contains a JSON version of all the metadata, in the
format selected for the writer.

So, for example, to get just the YAML metadata, you can
run pandoc with the following custom template:

    $meta-json$

Closes #2019.  The intent is to make it easier for static
site generators and other tools to get at the metadata.
4361dc0
@jgm jgm closed this in 4361dc0
@jgm
Owner

I've added meta-json. So now a template with just $meta-json$ will give you the document's metadata in JSON format (formatted according to the writer).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.