Monday, July 19, 2010

Changes to request body and RqData in head

I have just pushed some patches which affect the way the Request body and RqData are handled in happstack 0.6. This contains user visible changes which will affect you if you:



  • Use RqData

  • Directly use the rqBody field in Request

  • Directly use the rqInput field in Request

  • Directly work with the Input type

  • Allow file uploads


Some of the changes fix bugs (design flaws), and others are for new features and functionality. The non-compatible API changes are pretty small, so it should be easy to port your code. It basically comes down to:



  1. getDataFn, withDataFn, etc take an extra argument of the type BodyPolicy

  2. getDataFn, withDataFn, etc return Either [String] a instead of Maybe a

  3. the inputValue field of the Input type is now Either FilePath L.ByteString instead of L.ByteString

  4. you have to explicitly import the module Happstack.Server.RqData


In this post I will describe what motivated these changes. I am
hoping to also get feedback and these changes before we release 0.6 since it will be less painful to make further changes now.


the Request body and space usage


In the old code the Request type stores the request body as a simple lazy ByteString:


> newtype RqBody = Body { unBody :: L.ByteString } deriving (Read,Show,Typeable)
>
> data Request = Request { ...
> , rqBody :: RqBody
> }

This feels nice, because it is a simple, pure value. Unfortunately, it is really not a great idea in practice. The request body does not initially require any space, because it is an unevaluated lazy ByteString. But the ServerPart holds the Request in its environment, and that means the garbage collection can not free the RqBody as you evaluate it. If the request body contained gigabytes of data, that could be disastrous.


The solution in Happstack 0.6 is to use an MVar to hold the request body:


> 
> data Request = Request { ...
> , rqBody :: MVar RqBody
> }

Instead of using rqBody directly, it is better to use takeRequestBody, so that your code will not break if we switch to IORef or something else.


> takeRequestBody :: Request -> IO (Maybe RqBody)
> takeRequestBody rq = tryTakeMVar (rqBody rq)

Now, when you process the RqBody the Request will not be holding onto it, so the garbage collection can free it (assuming your code to not hold onto it and introduce a new space leak).


This does have a drawback however. A ServerPart can call mzero at anytime, and processing will move onto the next ServerPart. However, if you have already taken the RqBody then the next ServerPart may be missing critical data it needs. But, if we left the RqBody intact, that would result in the space leak. I think that in practice, if a ServerPart made enough progress that it started consuming the RqBody and then failed, it is unlikely that another ServerPart would succeed and need the RqBody. If another ServerPart succeeds, it is probably just a 404 Not Found handler or something similar, which does not need the request body. So it seems like it is better to have the default behavior be the more space friendly solution.


We will also provide peekRequestBody and/or putRequestBody functions so that you can opt to leave the request body intact. It is up to you to be sensible about using them.


BodyInput and space usage


In RqData, the cookies, QUERY_STRING, and request body (when appropriate) are parsed into a [(String, Input)], where String is the name of the key, and Input is the value.


In Happstack 0.6, Input will be the type:


> data Input = Input
> { inputValue :: Either FilePath L.ByteString
> , inputFilename :: Maybe FilePath
> , inputContentType :: ContentType
> } deriving (Show,Read,Typeable)

In Happstack 0.5 the inputValue is simply a L.ByteString. Once again, this seems fine at first. After all, the inputValues are lazy ByteString, so we can process them lazily, right? Well, not quite. In the unprocessed request body, the key/value pairs are laid out like this:



key1
value1
key2
value2
key3
value3
key4
value4
...

If we were to consume the key/value pairs in a sequential manner, then we would be ok. But, generally we want to use functions which can lookup a specific key. Imagine we want to look up key4. In order to do that we have to first read in all the preceding key/value pairs. If we knew we only cared about key4 then we could just toss the rest. But with the monadic RqData code we don't know that. (A future post will talk about an arrow based alternative where we do know that). So, we have to store all the key/value pairs in case we want to lookup key1 after key4.


In Happstack 0.5, we store all those values in RAM. But, some of those values might be (huge) files. That clearly isn't going to work. So we once again trade off a bit a simplicity/elegance for the practical matter of not having unlimited amounts of RAM. Instead we store some values in RAM and some values on the disk. How do we decide what goes where? That brings us to BodyPolicy.


BodyPolicy


When parsing the request body, we need some way to decide what values should be stored in RAM and what values should be saved to disk. Additionally, we want to impose limits on how much data can be stored in either location. If a user decides to post the contents of /dev/random you are likely to want to cut them off at some point. However, the specific values for the quotas are application specific. In fact, they may be specific to the particular form that is being processed. For example, an admin user might have higher quotas than a regular user.


The answers to these questions are provided by the BodyPolicy, which looks like:


> data BodyPolicy 
> = BodyPolicy { inputWorker :: Int64 -> Int64 -> Int64 -> InputWorker
> , maxDisk :: Int64 -- ^ maximum bytes to save to disk (files)
> , maxRAM :: Int64 -- ^ maximum bytes to hold in RAM
> , maxHeader :: Int64 -- ^ maximum header size (this only affects headers in the multipart/form-data)
> }

The inputWorker is the function that actually decides where values should be saved, and implements the quotas. Its Int64 arguments are the quotas for the disk, ram, and other headers which don't really get saved, but which can temporarily take up space. The next three fields are the values to pass to the inputWorker.


In most cases, you do not need to write you own inputWorker. It is sufficient to use the defaultBodyPolicy:

> defaultBodyPolicy :: FilePath -> Int64 -> Int64 -> Int64 -> BodyPolicy

The first argument is the directory to store temporary files in, and the next three arguments are the quota values. I am not going to cover defaultBodyPolicy in detail in this post. But it is well documented in the Happstack Crash Course.


Improvements to RqData


The new RqData module also includes a number of new features.


There is now an Applicative functor instance for RqData. The applicative functor instance accumulates errors. This means if you try look up multiple invalid keys, the error message will report all the missing values, not just the first one. This is nice when you are debugging your code, and is also nice if you provide a web service (REST API, etc) and want to provide your API users with detailed error messages instead of "Invalid Request".


We now provide two filters (body and queryString) which limit the scope of the look* functions to either the request body or the QUERY_STRING.


A new function lookFile is provided to assist with handling file uploads.


A new function checkRq is provided to help you convert
request parameters to Haskell types, or to check that a value meets some conditions.


Summary


This post gives some of the background on the changes to how we handle the request body and form data. To actually see what the changes look like in practice, you should check out the RqData section in the Happstack Crash Course. It gives detailed examples of all the features and changes I talked about in this post. I have also updated the haddock documentation in darcs.


I would love to hear your opinions. Do you love the changes? Hate the changes? Have better ideas about how to solve the problems? In terms of handling the raw request body, I believe both Yesod and Snap use the same basic approach -- the first handler to try to use the request body gets the whole thing, and everyone else gets nothing. (And they provide ways to put the request body back if you want to..).

1 comment: