dump metadata along with document fragment #2019
This could be done with a filter, e.g.
import Text.Pandoc.JSON
import Text.Pandoc
import Data.Aeson.Encode
import Data.Aeson.Types
import Data.ByteString.Lazy.UTF8
import Data.List
import qualified Data.Map as M
main :: IO ()
main = toJSONFilter inputMeta
inputMeta :: Pandoc -> Pandoc
inputMeta (Pandoc m b) = Pandoc m (mb:b)
where
mb = RawBlock (Format "html") $
"<!-- metadata:\n" ++ toString (encode $ metaToJSON m) ++ "\n-->"
metaToJSON :: Meta -> Value
metaToJSON (Meta m) = toJSON $ M.map metaValueToJSON m
metaValueToJSON :: MetaValue -> Value
metaValueToJSON (MetaMap m) = toJSON $ M.map metaValueToJSON m
metaValueToJSON (MetaList xs) = toJSON $ map metaValueToJSON xs
metaValueToJSON (MetaString t) = toJSON t
metaValueToJSON (MetaBool b) = toJSON b
metaValueToJSON (MetaInlines ils) = toJSON $ toHtml ils
metaValueToJSON (MetaBlocks bs) = toJSON $ toHtml' bs
toHtml :: [Inline] -> String
toHtml ils = html
where
html = writeHtmlString options $ Pandoc nullMeta [Plain ils]
toHtml' :: [Block] -> String
toHtml' bs = writeHtmlString options $ Pandoc nullMeta bs
options :: WriterOptions
options = def{writerHTMLMathMethod=MathML Nothing}
I don't think this makes a ton of sense as a core functionality. I would, however, appreciate a built-in metaToJSON
/metaValueToJSON
, as well as methods to translate from UPD: Silly me, there is Inlines
to String
for a given format without weird prefix-stripping. The latter doesn't make sense for all formats though.Plain
block-level element for that
Note, that whatever you want this for, you are probably better off just straight up writing a filter for it. You can choose between Haskell, Python, or in fact any language that can handle JSON input and output (e.g. NodeJS). Haskell and Python are supported though. You might want to look at http://johnmacfarlane.net/pandoc/scripting.html
I have experimented with this approach, and I think I can make it work, but it is suboptimal.
For context, I am trying to improve the metadata handling in liob/pandoc_reader, which uses Pandoc as the front end for a static site generator, Pelican; Pelican is written in Python. In this context, I am reluctant to require the Haskell compiler or the Pandoc libraries; the current code uses only the command-line tool.
Now, if I'm writing a filter in other-than-Haskell, I can't get at writeHtmlString
, so the best I can do is some kind of AST-to-AST transformation that embeds the metadata in the HTML output, preserving its structure. For instance, I can translate MetaList
to BulletList
, and MetaMap
and the top-level metadata object to DefinitionList
, and wrap value types in Plain
. Let me give a concrete example of the complex metadata I'm working with, and the output of the transformation I have written:
---
authors:
- Li, Ninghui
- Li, Tiancheng
- Venkatasubramanian, S.
title: "$t$-Closeness: Privacy Beyond $k$-Anonymity and $l$-Diversity"
booktitle:
shortname: ICDE 2007
fullname: IEEE 23rd International Conference on Data Engineering, 2007
url: http://www.computer.org/csdl/proceedings/icde/2007/0802/00/index.html
year: 2007
month: April
pages: 106--115
doi: 10.1109/ICDE.2007.367856
tags: [data privacy, database theory, attribute disclosure,
$k$-anonymity, $l$-diversity, $t$-closeness]
...
body of document
becomes
<dl>
<dt>authors</dt>
<dd><ul>
<li>Li, Ninghui</li>
<li>Li, Tiancheng</li>
<li>Venkatasubramanian, S.</li>
</ul>
</dd>
<dt>booktitle</dt>
<dd><dl>
<dt>fullname</dt>
<dd>IEEE 23rd International Conference on Data Engineering, 2007
</dd>
<dt>shortname</dt>
<dd>ICDE 2007
</dd>
<dt>url</dt>
<dd>http://www.computer.org/csdl/proceedings/icde/2007/0802/00/index.html
</dd>
</dl>
</dd>
<dt>doi</dt>
<dd>10.1109/ICDE.2007.367856
</dd>
<dt>month</dt>
<dd>April
</dd>
<dt>pages</dt>
<dd>106--115
</dd>
<dt>tags</dt>
<dd><ul>
<li>data privacy</li>
<li>database theory</li>
<li>attribute disclosure</li>
<li><math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mi>k</mi></mrow></math>-anonymity</li>
<li><math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mi>l</mi></mrow></math>-diversity</li>
<li><math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mi>t</mi></mrow></math>-closeness</li>
</ul>
</dd>
<dt>title</dt>
<dd><math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mi>t</mi></mrow></math>-Closeness: Privacy Beyond <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mi>k</mi></mrow></math>-Anonymity and <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mi>l</mi></mrow></math>-Diversity
</dd>
<dt>year</dt>
<dd>2007
</dd>
</dl>
<hr />
<p>body of document</p>
which I would then split at the <hr />
(incidentally, why does -t html5
emit XMLisms?) and parse the top half of back into a data structure. This is less than ideal for two reasons. First, parsing HTML is significantly more complicated than parsing JSON as originally requested. There is no way to generate JSON with this approach, because there is no way to direct Pandoc to render the contents of an (ex-) MetaInlines or MetaBlocks as HTML and then quote it for JSON. Second, closely related, there's no definite break point between the HTML defining the data structure, and the HTML of each value. I may be able to patch around that with <span>
or something, but it'll never be better than awkward.
Thinking out loud, a potential fix is (1) a new AST node type that means "render what's under this node in output format X and then quote it as a string literal for the surrounding context", (2) some way of generating a custom JSON tree (rather than a literal serialization of the AST). (1) might also be useful for, like, embedding examples of the rendered output in format X in a document of format Y.
The thing I originally asked for seems simpler overall, and easier to implement, though.
Can you not write a filter which just dumps the JSON to a file? If you then really want them in the same file you can then just cat the metadata dump and the output pandoc produces together.
@mpickering The JSON structure passed to the filter is
[{ "unMeta": {
"title": {"t":"MetaInlines","c":[
{"t":"Math","c":[{"t":"InlineMath","c":[]},"t"]},
{"t":"Str","c":"-Closeness:"},
{"t":"Space","c":[]},
{"t":"Str","c":"Privacy"},
{"t":"Space","c":[]},
{"t":"Str","c":"Beyond"},
{"t":"Space","c":[]},
{"t":"Math","c":[{"t":"InlineMath","c":[]},"k"]},
{"t":"Str","c":"-Anonymity"},
{"t":"Space","c":[]},
{"t":"Str","c":"and"},
{"t":"Space","c":[]},
{"t":"Math","c":[{"t":"InlineMath","c":[]},"l"]},
{"t":"Str","c":"-Diversity"}
]},
"authors": {"t":"MetaList","c":[
{"t":"MetaInlines","c":[
{"t":"Str","c":"Li,"},
{"t":"Space","c":[]},
{"t":"Str","c":"Ninghui"}
]},
{"t":"MetaInlines","c":[
{"t":"Str","c":"Li,"},
{"t":"Space","c":[]},
{"t":"Str","c":"Tiancheng"}
]},
{"t":"MetaInlines","c":[
{"t":"Str","c":"Venkatasubramanian,"},
{"t":"Space","c":[]},
{"t":"Str","c":"S."}]}
]}
// ...
}},
[/*body of document here */]]
The JSON structure I want is
{
"title": "<math display=\"inline\"><mrow><mi>t</mi></mrow></math>-Closeness: Privacy Beyond <math display=\"inline\"><mrow><mi>k</mi></mrow></math>-Anonymity and <math display=\"inline\"><mrow><mi>l</mi></mrow></math>-Diversity",
"authors": [
"Li, Ninghui",
"Li, Tiancheng",
"Venkatasubramanian, S."
],
// ...
}
The only way to get to B from A is to pass back through Pandoc's HTML generator.
if you know the format of the structure ahead of time, could you just write a custom template?
I don't know the structure ahead of time; it appears that there is no way to iterate over all available variables, nor discriminate variables by origin, nor to recursively walk an unknown tree structure.
Also, it appears that there is no way to request any sort of syntactic quotation.
I am reluctant to require the Haskell compiler or the Pandoc libraries; the current code uses only the command-line tool.
Look, in absolute majority of cases, if pandoc is installed, so is haskell runtime. It means that, at the very least, you can run haskell filters through pandoc itself. It is suboptimal in terms of speed, but since it's not used for dynamic content generation of some sort, it shouldn't be a big concern.
The following change to the HTML writer would add a meta-json
template variable containing a JSON version of the formatted metadata:
diff --git a/src/Text/Pandoc/Writers/HTML.hs b/src/Text/Pandoc/Writers/HTML.hs
index 53dc931..93834c1 100644
--- a/src/Text/Pandoc/Writers/HTML.hs
+++ b/src/Text/Pandoc/Writers/HTML.hs
@@ -43,6 +43,8 @@ import Text.Pandoc.XML (fromEntities, escapeStringForXML)
import Network.URI ( parseURIReference, URI(..), unEscapeString )
import Network.HTTP ( urlEncode )
import Numeric ( showHex )
+import qualified Data.Aeson as Aeson
+import Text.Pandoc.UTF8 (toStringLazy)
import Data.Char ( ord, toLower )
import Data.List ( isPrefixOf, intersperse )
import Data.String ( fromString )
@@ -194,6 +196,7 @@ pandocToHtml opts (Pandoc meta blocks) = do
defField "revealjs-url" ("reveal.js" :: String) $
defField "s5-url" ("s5/default" :: String) $
defField "html5" (writerHtml5 opts) $
+ defField "meta-json" (toStringLazy $ Aeson.encode metadata) $
metadata
return (thebody, context)
This could be used with a custom template like
<!--
$meta-json$
-->
$body$
to get what @zackw is looking for.
So, one possible change to pandoc would be to define a meta-json
variables in all writers. Rather than changing all the writers one by one, it would make sense to modify the metaToJSON
function. I can see how this would make it easier to integrate pandoc with other things, like static site generators. What do people think?
Better, more general, patch, affecting all writers:
diff --git a/src/Text/Pandoc/Writers/Shared.hs b/src/Text/Pandoc/Writers/Shared.hs
index 800e741..cc9e59d 100644
--- a/src/Text/Pandoc/Writers/Shared.hs
+++ b/src/Text/Pandoc/Writers/Shared.hs
@@ -45,7 +45,8 @@ import Text.Pandoc.Options (WriterOptions(..))
import qualified Data.HashMap.Strict as H
import qualified Data.Map as M
import qualified Data.Text as T
-import Data.Aeson (FromJSON(..), fromJSON, ToJSON (..), Value(Object), Result(..))
+import Data.Aeson (FromJSON(..), fromJSON, ToJSON (..), Value(Object), Result(..), encode)
+import Text.Pandoc.UTF8 (toStringLazy)
import qualified Data.Traversable as Traversable
import Data.List ( groupBy )
@@ -67,7 +68,8 @@ metaToJSON opts blockWriter inlineWriter (Meta metamap)
renderedMap <- Traversable.mapM
(metaValueToJSON blockWriter inlineWriter)
metamap
- return $ M.foldWithKey defField baseContext renderedMap
+ let metadata = M.foldWithKey defField baseContext renderedMap
+ return $ defField "meta-json" (toStringLazy $ encode metadata) metadata
| otherwise = return (Object H.empty)
metaValueToJSON :: Monad m
I like this as long as it does the Right Thing with complicated quoting cases like
---
title: "`<!-- HTML Comments And You -->`: An \"Informal\" Discussion"
author: Alice & Bob
...
this being only what I could think of off the top of my head, I'm sure there are nastier constructs.
Adding a here, because I have very similar needs to those outlined by @zackw, and he and I ended up independently working around the issue elsewhere (see liob/pandoc_reader#3, liob/pandoc_reader#4, and liob/pandoc_reader#5). I also note that integration with other possible static site generators could be a big win, since pandoc is in my experience meaningfully faster than many other implementations. (E.g. it moves at least twice as fast as the standard Python Markdown implementation—I just compared the two on a ~16k-line test file with as close to the same settings for parsing as possible, and it runs in half the time. For a file 10⨉ that size… well, Python Markdown just falls down; it never finished. )
@bpj That's true in a general sense, but it doesn't get at the issue here, and it certainly doesn't give you the data back in a format (e.g. JSON) readily transformed or handed around within another application, which is the context which drove @zackw's request (and is my interest as well): both of us are using pandoc to drive Pelican, and are doing a bit of a dance to handle YAML metadata in that context.
I've added meta-json
. So now a template with just $meta-json$
will give you the document's metadata in JSON format (formatted according to the writer).
I am looking for a way to get pandoc to dump out metadata along with a document fragment. This should behave as follows:
An example would probably help: given
pandoc -t html5+metadata --mathml
should produce something likeIt's quite possible that there's already a way to do something like this and I just can't find it, in which case I would appreciate a pointer.
A way to dump only the metadata, but still applying a rendering, would also be useful.