Monday, July 19, 2010

Changes to request body and RqData in head

I have just pushed some patches which affect the way the Request body and RqData are handled in happstack 0.6. This contains user visible changes which will affect you if you:



  • Use RqData

  • Directly use the rqBody field in Request

  • Directly use the rqInput field in Request

  • Directly work with the Input type

  • Allow file uploads


Some of the changes fix bugs (design flaws), and others are for new features and functionality. The non-compatible API changes are pretty small, so it should be easy to port your code. It basically comes down to:



  1. getDataFn, withDataFn, etc take an extra argument of the type BodyPolicy

  2. getDataFn, withDataFn, etc return Either [String] a instead of Maybe a

  3. the inputValue field of the Input type is now Either FilePath L.ByteString instead of L.ByteString

  4. you have to explicitly import the module Happstack.Server.RqData


In this post I will describe what motivated these changes. I am
hoping to also get feedback and these changes before we release 0.6 since it will be less painful to make further changes now.


the Request body and space usage


In the old code the Request type stores the request body as a simple lazy ByteString:


> newtype RqBody = Body { unBody :: L.ByteString } deriving (Read,Show,Typeable)
>
> data Request = Request { ...
> , rqBody :: RqBody
> }

This feels nice, because it is a simple, pure value. Unfortunately, it is really not a great idea in practice. The request body does not initially require any space, because it is an unevaluated lazy ByteString. But the ServerPart holds the Request in its environment, and that means the garbage collection can not free the RqBody as you evaluate it. If the request body contained gigabytes of data, that could be disastrous.


The solution in Happstack 0.6 is to use an MVar to hold the request body:


> 
> data Request = Request { ...
> , rqBody :: MVar RqBody
> }

Instead of using rqBody directly, it is better to use takeRequestBody, so that your code will not break if we switch to IORef or something else.


> takeRequestBody :: Request -> IO (Maybe RqBody)
> takeRequestBody rq = tryTakeMVar (rqBody rq)

Now, when you process the RqBody the Request will not be holding onto it, so the garbage collection can free it (assuming your code to not hold onto it and introduce a new space leak).


This does have a drawback however. A ServerPart can call mzero at anytime, and processing will move onto the next ServerPart. However, if you have already taken the RqBody then the next ServerPart may be missing critical data it needs. But, if we left the RqBody intact, that would result in the space leak. I think that in practice, if a ServerPart made enough progress that it started consuming the RqBody and then failed, it is unlikely that another ServerPart would succeed and need the RqBody. If another ServerPart succeeds, it is probably just a 404 Not Found handler or something similar, which does not need the request body. So it seems like it is better to have the default behavior be the more space friendly solution.


We will also provide peekRequestBody and/or putRequestBody functions so that you can opt to leave the request body intact. It is up to you to be sensible about using them.


BodyInput and space usage


In RqData, the cookies, QUERY_STRING, and request body (when appropriate) are parsed into a [(String, Input)], where String is the name of the key, and Input is the value.


In Happstack 0.6, Input will be the type:


> data Input = Input
> { inputValue :: Either FilePath L.ByteString
> , inputFilename :: Maybe FilePath
> , inputContentType :: ContentType
> } deriving (Show,Read,Typeable)

In Happstack 0.5 the inputValue is simply a L.ByteString. Once again, this seems fine at first. After all, the inputValues are lazy ByteString, so we can process them lazily, right? Well, not quite. In the unprocessed request body, the key/value pairs are laid out like this:



key1
value1
key2
value2
key3
value3
key4
value4
...

If we were to consume the key/value pairs in a sequential manner, then we would be ok. But, generally we want to use functions which can lookup a specific key. Imagine we want to look up key4. In order to do that we have to first read in all the preceding key/value pairs. If we knew we only cared about key4 then we could just toss the rest. But with the monadic RqData code we don't know that. (A future post will talk about an arrow based alternative where we do know that). So, we have to store all the key/value pairs in case we want to lookup key1 after key4.


In Happstack 0.5, we store all those values in RAM. But, some of those values might be (huge) files. That clearly isn't going to work. So we once again trade off a bit a simplicity/elegance for the practical matter of not having unlimited amounts of RAM. Instead we store some values in RAM and some values on the disk. How do we decide what goes where? That brings us to BodyPolicy.


BodyPolicy


When parsing the request body, we need some way to decide what values should be stored in RAM and what values should be saved to disk. Additionally, we want to impose limits on how much data can be stored in either location. If a user decides to post the contents of /dev/random you are likely to want to cut them off at some point. However, the specific values for the quotas are application specific. In fact, they may be specific to the particular form that is being processed. For example, an admin user might have higher quotas than a regular user.


The answers to these questions are provided by the BodyPolicy, which looks like:


> data BodyPolicy 
> = BodyPolicy { inputWorker :: Int64 -> Int64 -> Int64 -> InputWorker
> , maxDisk :: Int64 -- ^ maximum bytes to save to disk (files)
> , maxRAM :: Int64 -- ^ maximum bytes to hold in RAM
> , maxHeader :: Int64 -- ^ maximum header size (this only affects headers in the multipart/form-data)
> }

The inputWorker is the function that actually decides where values should be saved, and implements the quotas. Its Int64 arguments are the quotas for the disk, ram, and other headers which don't really get saved, but which can temporarily take up space. The next three fields are the values to pass to the inputWorker.


In most cases, you do not need to write you own inputWorker. It is sufficient to use the defaultBodyPolicy:

> defaultBodyPolicy :: FilePath -> Int64 -> Int64 -> Int64 -> BodyPolicy

The first argument is the directory to store temporary files in, and the next three arguments are the quota values. I am not going to cover defaultBodyPolicy in detail in this post. But it is well documented in the Happstack Crash Course.


Improvements to RqData


The new RqData module also includes a number of new features.


There is now an Applicative functor instance for RqData. The applicative functor instance accumulates errors. This means if you try look up multiple invalid keys, the error message will report all the missing values, not just the first one. This is nice when you are debugging your code, and is also nice if you provide a web service (REST API, etc) and want to provide your API users with detailed error messages instead of "Invalid Request".


We now provide two filters (body and queryString) which limit the scope of the look* functions to either the request body or the QUERY_STRING.


A new function lookFile is provided to assist with handling file uploads.


A new function checkRq is provided to help you convert
request parameters to Haskell types, or to check that a value meets some conditions.


Summary


This post gives some of the background on the changes to how we handle the request body and form data. To actually see what the changes look like in practice, you should check out the RqData section in the Happstack Crash Course. It gives detailed examples of all the features and changes I talked about in this post. I have also updated the haddock documentation in darcs.


I would love to hear your opinions. Do you love the changes? Hate the changes? Have better ideas about how to solve the problems? In terms of handling the raw request body, I believe both Yesod and Snap use the same basic approach -- the first handler to try to use the request body gets the whole thing, and everyone else gets nothing. (And they provide ways to put the request body back if you want to..).

Sunday, July 11, 2010

sendfile 0.7.1

I have just uploaded sendfile 0.7.1 to hackage.

The sendfile library exposes zero-copy sendfile functionality in a portable way. If a platform does not support sendfile, a fallback implementation in Haskell is provided. It currently has zero-copy support for Linux, Darwin, FreeBSD, and Windows.

The sendfile functionality typically reduces CPU-load and (possibly) increases IO throughput.

The new release of sendfile adds the ability to hook into the send loop. This is useful if you want to tickle timeouts or update a progress bar while the file is being sent.

This turned out to be rather tricky because each platform implements sendfile a little differently. But, the point of the sendfile library is to provide a unified interface so that other developers do not have to know any of the platform specific details.

The solution in 0.7.1 is to use a simple, specialized iteratee. Each pass of the sendfile loop can end in one of three states:

(1) the requested number of bytes for that iteration was sent
successfully, there are more bytes left to send.

(2) some (possibly 0) bytes were sent, but the file descriptor
would now block if more bytes were written. There are more bytes
left to send.

(2) All the bytes were sent, and there is nothing left to send.

We handle these three cases by using a type with three
constructors:

data Iter
= Sent Int64 (IO Iter)
| WouldBlock Int64 Fd (IO Iter)
| Done Int64

All three constructors provide an Int64 which represents the
number of bytes sent for that particular iteration. (Not the total
byte count).

The Sent and WouldBlock constructors provide IO Iter as their
final argument. Running this IO action will send the next block of
data.

The WouldBlock constructor also provides the Fd for the output
socket. You should not send anymore data until the Fd would not
block. The easiest way to do that is to use threadWaitWrite to
suspend the thread until the Fd is available.

A very simple function to drive the Iter might look like:

runIter :: IO Iter -> IO ()
runIter iter =
do r <- iter
case r of
(Done _n) -> return ()
(Sent _n cont) -> runIter cont
(WouldBlock _n fd cont) ->
do threadWaitWrite fd
runIter cont

You would use it as the first argument to a *IterWith function, e.g.

sendFileIterWith runIter outputSocket "/path/to/file" 2^16

If we want to do something fancier, such as update timeouts or a progress bar, we can do it in a custom runIter function. If we are using a non-standard I/O manager, we might be able to suspend the thread via a call other than threadWaitWrite.

What Next?


The new version of sendfile will be used to improve the timeout handling in the Haskell web framework, Happstack.

It would be nice if the sendfile library could export a low-level function like:

sendfile :: Fd -> Fd -> Int64 -> Int64 -> IO (Bool, Int64)

It would take the output socket, and input file descriptor, an offset, and length, and return the number of bytes written, and whether the output socket blocked.

Unfortunately, it is not possible to provide a portable implementation of this sendfile function. That would require functions which can operate directly on the Fds. But those functions live in the unix package, which is not portable.

Another non-solution is to have a module like, Network.Socket.SendFile.LowLevel which is only exported on the platforms which provide a low-level sendfile implementation. However, it is my understanding that this is not really allowed by the cabal policy because there would be no way to specify that you require a version of the sendfile library that exports .LowLevel.

So, I believe a more correct solution is to create a *new* package, sendfile-lowlevel, which exports Network.Socket.SendFile.LowLevel. This assumes that there is some way to mark that a package is only available on certain platforms. However, I am not sure if that can be done.

Hopefully the new API provides enough flexibility that there is no need for an even lower-level API to be exposed. If you think you need something lower-level, let me know, and let's see if we can work something out.