Faster Paint Times

Datetime:2016-08-23 03:04:39         Topic: CSS          Share        Original >>
Here to See The Original Article!!!

Quora's mission is to share and grow the world's knowledge, so it's critical we create a product that enables anyone to quickly read and contribute content across all platforms. To that end, the speed of our web and mobile products is one of our engineering team's top priorities. Studies have repeatedly found that faster pages lead to more engaged users , so it's important that we minimize the amount of time that our users are left at a blank page waiting for content to be displayed. Page loading times are affected by many different factors, ranging from the time taken by the server to the size of the response, and in this post, we'll focus specifically on one—the time needed for a page to paint.

Paint time represents the amount of time before elements on the page are rendered by the browser and appear to the end user. At Quora, we use a metric called Speed Index to track paint times. Rather than using the single timestamp at which the first paint is completed by the browser, Speed Index is defined as the average time at which visible parts of the page are displayed (measured in milliseconds); put another way, a Speed Index of n means that after n milliseconds, the average pixel on the page has been painted.

To optimize our paint times, we take the following approach:

  1. Start sending bytes as soon as possible
  2. Send as few render blocking bytes as possible
  3. Parallelize as much as possible

Together, these three techniques have improved the Speed Index of our pages by over 20%. In this post, we'll discuss each of them in detail.

Start Sending Bytes as soon as Possible

In a typical HTTP response, the server buffers the full content of the response before sending any data to the client. However, with chunked transfer encoding (first introduced in HTTP 1.1), the server can break up a single response into multiple chunks of data, which means the server can send data to the client without buffering the entire response. Chunked encoding thus lends itself well to dynamic sites, as the server can progressively send parts of a page to the client as soon as they're ready. So, with chunked encoding, pages are not bottlenecked by their slowest components (since they can be sent via independent chunks), which means users will see content more quickly. Facebook, for instance, takes this approach with their BigPipe framework .

At Quora, our pages are constructed using building blocks that look like the following:

# Render method of an example piece of UI.
def render():
    subtree = []
    with h.Div().into(subtree):
        with h.Span().into(subtree):
            subtree += h.text('Hello World')
    return subtree

# The render method returns a list of pairs which consists of a
# directive and a UI element.
subtree = [
    ('topen', Div()),
    ('topen', Span()),
    ('text', 'Hello World'),
    ('tclose', Span()),
    ('tclose', Div()),

# Each directive tells the renderer how to transform the UI Element
# into a html string.
<div><span>Hello World</span></div>

Taking this a step further, we group logical UI pieces into component objects:

class FeedStoryItem(object):
    def render(self):

class Feed(object):
    def render(self):
        subtree = []
        with h.Div(class_=['wrapper']).into(subtree):
            subtree += FeedStoryItem(answer_id=5)
            subtree += FeedStoryItem(answer_id=6)
        return subtree

As you can see, each component is responsible for defining its own subtree. So, to render a page, we simply recursively render the root component on the page.

for directive, data in page.render():
    # Render data (and recurse if directive is component)
    yield data.render()

With this setup, we can flush the content of each component to the client as soon as it finishes rendering. This allows us to send parts of the page to the user as soon as possible, without being blocked on a particular slow component on the page. As long as components above the fold are rendered quickly, users can see a nearly complete page render without having to wait for all components to be done rendering.

The following two filmstrips demonstrate this change in action.

In the above timeline, the entire page is buffered on the server and flushed once at 1.07s. So, before that single flush, the user is left looking at a blank page.

In this second timeline, the page is rendered and sent in chunks. As you can see, the page is painted progressively, which means the user can see content (i.e., the header and sidebar) much earlier, at 659ms. Even though the feed portion of the page is displayed at around the same time across both loads, the user is able to see parts of the page sooner in the second scenario. So, the Speed Index of the second example is lower, as the average pixel is visually complete sooner than the first example.

Send as few render blocking bytes as possible

Simply serving pages in chunks doesn't necessarily guarantee that users will see content sooner, as the page's first paint is also blocked by CSS. In order to prevent a jarring flash of unstyled content, browsers don't render any content until the page's CSS has been fully downloaded and parsed. So, even when the HTML of a page (or of a chunk) has been downloaded, users will still see a blank screen until the CSS finishes downloading as well. The browser's navigation timing, resource timing, and paint timing contains more information about what specifically blocks each pages firstPaint event (and how often).

A common solution to this problem is inlining a page's critical CSS , or the minimum set of CSS rules for that page. Rather than forcing the browser to download the entire CSS payload—which may contain CSS rules that aren't used on the page or rules that apply to elements that are initially hidden—before rendering anything, we can inline the necessary CSS rules directly on the page, which means the browser doesn't need to block rendering on an external stylesheet. Later, the browser can fetch the stylesheet so it can be cached for subsequent requests.

For a dynamic site, the critical CSS for a page might be different with each request. Consider the FeedStoryItem component we defined above—not only does it define an HTML subtree, but it also has an associated set of CSS rules. Since this component is reusable, a developer should be able to easily add it to a new page, at which point its CSS becomes part of the critical CSS for that page. Furthermore, any component may be conditionally rendered on a page, so its associated CSS may or may not be part of the page's critical CSS.

In order to dynamically determine a page's critical CSS in the context of a web request, we created a tool called ParseCSS , which we're open sourcing today. Whereas many other critical CSS tools are designed for mostly static sites—taking as input an already-generated HTML files along with its CSS—ParseCSS is specifically designed for dynamic pages, where the critical CSS might change with each request.

ParseCSS is a standalone command line tool written using node.js and the popular postcss library. Given a CSS file, ParseCSS outputs a JSON representation of the CSS AST, so developers don't have to deal with the AST directly. Using this output, we can determine in our rendering code which styles should be inlined in each response.

The output of ParseCSS looks like the below:

    "globalCss": [
        "html, body {...}",

    "keyframesCss": [
        ["fadeIn", "@keyframes fadeIn {...}"],
    "fontfaceCss": [
        "@fontface {...}",

    "classListCssPairs": [
        [["link"], ".link { ...}"],
        [["hidden", "link"], ".hidden { ...}"],

Using this output, we can create an in-memory representation of our production CSS using a trie. Then, we collect the list of CSS classes used on the page (which is often dynamic), which are specified in components like this:

h.Div(class_=['link', 'primary'])

Finally, given a list of CSS classes, we can efficiently query the in-memory trie in order to get the page's critical CSS, which we then inline on the page. Now, each chunk contains its critical CSS inline, which means it can be painted by the browser immediately, rather than waiting for an external stylesheet to download. Taking this a step further, we can also de-duplicate CSS rules, so subsequent chunks don't include rules that have already been inlined by previous chunks. This approach ensures that our progressive rendering happens as quickly as possible.

Parallelize as much as possible

We can leverage parallelism on both the server and the client to optimize our chunk-based rendering even further.

On the server, we've long rendered subtrees of the DOM in parallel . With chunked encoding, we updated our parallel rendering framework to render chunks in parallel with each other, then send each completed chunk to the client as soon as the render has completed. So, chunks can be rendered independently, which means all chunks are not bottlenecked by the slowest chunk on the page.

Returning to our earlier example, rendering chunks in parallel looks like this:

subtree = [
    FeedStoryItemPromise(answer_id=5), # previously FeedStoryItem(aid=5)

for directive, data in page.render():
    if isinstance(data, Promise):
        # Blocks until this particular chunk is ready.
        # Other parallelized subtrees are concurrently being
        # rendered too!
        yield data.render()

On the client, we can leverage parallelism to improve the efficiency of resource downloads. In a naïve page load, the following steps happen in series:

  1. User requests a page
  2. Page is constructed server-side
  3. Page is sent to client
  4. Browser parses the page
  5. Browser downloads CSS, JS, and fonts in the page head
  6. Browser renders the page
  7. User sees the page

Each of these steps has a significant amount of idle time, either on the server, the network, or the client. With chunked encoding, we can introduce more parallelism into resource downloading to eliminate much of this idle time.

In a typical HTML page, the head of the page contains links to other resources required on that the page, including CSS, JavaScript, fonts, and images. For websites with large assets, downloading these files can take a significant amount of time, which in turn increases the amount of time before the page is painted or interactive. The following waterfall demonstrates this effect, where the response payload and corresponding assets are not downloaded in parallel:

In order to improve the parallelism of these resource downloads, we can send the head of the page in a separate chunk that's received by the client first. As soon as the browser receives the head chunk, it can start downloading CSS, JavaScript, fonts, and images before the entire page is downloaded (i.e., the blue bar). In doing so, we've parallelized the fetching of these sub-resources by the client and the page construction by the server.

The below waterfall shows this parallelization improvement in action:

By sending the head chunk first, the browser was able to initiate downloads of other assets before the page was completely downloaded.

Closing thoughts

In this post, we covered our approach to improving paint times. To summarize:

  • We made changes to allow our servers to respond with parts of the page as soon as they were ready using chunked encoding.
  • We changed our critical CSS implementation to support chunked responses.
  • We integrated these changes with our parallel rendering framework to further leverage parallelism.

Since we started rolling out these changes, we've seen Speed Index of our pages improve by about 20%. We've also seen page load times improve by about 5% from the improvements to pipelining. There's a lot we can do to make Quora faster, and more projects to improve the performance of our web and mobile products are already underway. If you're interested in working on performance at Quora, check out the open engineering roles on our careers page !


Put your ads here, just $200 per month.