Pseudo-random

person and thoughts

145 notes &

Graylog2 and its MongoDB usage

Centralized logging is definitely not a new problem. syslogd was in use for ages, receiving your loosely-structured data over a loosely-defined protocol. Then some new implementations appeared, most notably syslog-ng and rsyslog. Then such cloud services as Splunk and Loggly arrived. Even syslog protocol is now more specific. But some parts are still missing: for example, simple but powerful web interface, graphs and notifications, installation behind your firewall, support for logging and searching for structured data from applications. From all of those, letter is the most important thing for me. I would like to find all events related to post with id 42, not just all messages with post and 42 into its text. Until recently, there were no such tools, so I started developing my own… and then I heard about Graylog2.

Graylog2 is a solution for centralized collecting and processing log messages in syslog (both legacy and standard formats) and own GELF formats. It consists on collecting server and web application to view and search collected data. It was already quite powerful when I discovered it, and idea of GELF is very close to my thoughts, so I ditched my own reinvention of wheel and started contributing to Graylog2, where I’m working to make it round. :)

But what is GELF anyway? Graylog Extended Log Format is a documented specification for log messages and other application data with straightforward support for structured data elements. It’s just a JSON object with name/value pairs, compressed and sent to server over UDP or TCP. How should we store it? We can’t just do it in memory – we need a persistent storage. We can’t easily do it in a typical database with hard schema – implementing a search by additional fields will require a lot of work. (If you are an archaeologist, you may want to dig into SQL-powered Graylog1 sources there.) Implementing it on top of low-level key/value system like Redis would require a similar amount of work – you need to implement your own indexes. (Redis is great, btw, but it’s a theme for another post.) So we need something in between: schema-less storage with indexes, everything else are bonuses. With Mongo DB we got a lot of them:

  • capped collections, which preserve insertion order and maximum collection size, automatically removing old messages;
  • tailed cursors, which are used to implement tail -f behavior;
  • awesome performance;
  • and an easy API.

All those features allow us to concentrate on our work on server and web app, thinking about backend only occasionally. I give you the concrete example of situation when it happens. For this I grabbed some messages from deployed instance:

> db.messages.dataSize()
129779772   # 120 MB

> db.messages.count()
210728

> db.messages.getIndexes()
[
    {
        "name" : "_id_",
        "ns" : "graylog2.messages",
        "key" : {
            "_id" : 1
        },
        "v" : 0
    }
]

> db.messages.isCapped()
false

I imported messages into uncapped collection without any custom indexes. I want to profile a specific query.

> db.setProfilingLevel(2)

> db.messages.find({'_user_id': '163684'})
(some data)

> db.system.profile.find()
{ "ts" : ISODate("2011-04-30T12:35:32.397Z"),
  "info" : "query graylog2.messages
            reslen:19731 nscanned:150594
            query: { _user_id: \"163684\" }
            nreturned:101 bytes:19715 218ms",
  "millis" : 218 }

Value of nscanned field indicates amount of scanned records. Since I don’t have index, Mongo performs a full table scan. Let’s create index then:

> db.messages.ensureIndex({'_user_id': 1})

> db.messages.find({'_user_id': '163684'})
(some data)

> db.system.profile.find()
{ "ts" : ISODate("2011-04-30T12:57:49.922Z"),
  "info" : "query graylog2.messages
            reslen:19731 nscanned:102
            query: { _user_id: \"163684\" }
            nreturned:101 bytes:19715",
  "millis" : 25 }

No magic there, just typical indexes you know in any SQL database. But _user_id field is not defined in any schema, and in fact I have it only because I sent actual GELF messages to server which contained it.

Here is another example, a query for Graylog2’s main page, showing recent messages:

> db.messages.find().sort({'_id': -1}).limit(100)
(some data)

> db.system.profile.find()
{ "ts" : ISODate("2011-04-30T13:10:09.843Z"),
  "info" : "query graylog2.messages
            ntoreturn:100 reslen:78537 nscanned:101
            query: { query: {}, orderby: { _id: -1.0 } }
            nreturned:100 bytes:78521",
  "millis" : 0 }

This query is effective because I have an index on _id column. Using it for sorting makes sense because generated ObjectIDs are starting from current timestamp. Let’s convert this collection to a capped one to drop old messages automatically:

> db.messages.convertToCapped(db.messages.totalSize() * 3)
{ "ok" : 1 }

> db.messages.find().sort({'_id': -1}).limit(100)
(takes some time!)

> db.system.profile.find()
{ "ts" : ISODate("2011-04-30T13:15:30.285Z"),
  "info" : "query graylog2.messages
            ntoreturn:100 scanAndOrder  reslen:78537 nscanned:210728
            query: { query: {}, orderby: { _id: -1.0 } }
            nreturned:100 bytes:78521 958ms",
  "millis" : 958 }

Whoa! What happens? It’s a full table scan.

> db.messages.getIndexes()
[ ]

Converting a collection into a capped one dropped our index by _id. We don’t really need it for this situation, since Mongo preserves insert order when returns data from capped collection.

> db.messages.find().sort({$natural: -1}).limit(100)
(some data)

> db.system.profile.find()
{ "ts" : ISODate("2011-04-30T14:20:49.060Z"),
  "info" : "query graylog2.messages
            ntoreturn:100 reslen:78537 nscanned:101
            query: { query: {}, orderby: { $natural: -1.0 } }
            nreturned:100 bytes:78521",
  "millis" : 0 }

Much better. Let’s combine our examples:

> db.messages.ensureIndex({'_user_id': 1})

> db.messages.find({'_user_id': '163684'}).sort({$natural: -1}).limit(100)
(takes some time :( )

> db.system.profile.find()
{ "ts" : ISODate("2011-04-30T14:27:13.639Z"),
  "info" : "query graylog2.messages
            ntoreturn:100 reslen:19536 nscanned:174758
            query: { query: { _user_id: \"163684\" },
                            orderby: { $natural: -1.0 } }
            nreturned:100 bytes:19520 235ms",
  "millis" : 235 }

It’s a (almost) full table scan again. Let’s create index by _id and try to use it.

> db.messages.ensureIndex({'_id': 1})

> db.messages.find({'_user_id': '163684'}).sort({'_id': -1}).limit(100)
(some data)

> db.system.profile.find()
{ "ts" : ISODate("2011-04-30T14:34:18.513Z"),
  "info" : "query graylog2.messages
            ntoreturn:100 scanAndOrder  reslen:19536 nscanned:106
            query: { query: { _user_id: \"163684\" },
                            orderby: { _id: -1.0 } }
            nreturned:100 bytes:19520",
  "millis" : 1 }

So what are the results? If you use any condition in your query for a capped collection and want to see results sorted by an insertion order you still need an index by _id. At the same time a natural order is not preserved for an uncapped collection. So in Graylog2 we use a capped collection with index by _id.

But this quite a simple situation is really the hardest MongoDB-related problem we have ever met. Built-in profiler and explain() mechanism, which is not covered here, are great supporters for us. Given all the cool features I have mentioned in the begging of this post, Mongo is a great backend for us – fast, powerful and simple in the same time.

Filed under graylog2 mongo nosql

  1. jacynth-chase reblogged this from alexey-palazhchenko
  2. alexey-palazhchenko posted this