HouseWatch: Open-source tool for monitoring and managing ClickHouse clusters

craigching · on June 17, 2023

Integrate this into Grafana as an app plugin and you’d have me. I don’t want to leave Grafana where I have all my other operational dashboards for this.

cplli · on June 17, 2023

Now we need a tool called NeighbourhoodWatch to monitor the cluster monitors.

donutshop · on June 17, 2023

NeighborhoodWatch for US based resources

ram_rar · on June 17, 2023

Love the tool, but its not practical in the enterprise world to have yet another dashboard service to look at just for metrics. It would be great, if this plays well with grafana or Otel collectors.

OTOH, monitoring long running background jobs on CH cluster is very valuable to have. Its real pain to verify, if parent and child queries have executed correctly. I would suggest doubling down on features that users cannot readily get via grafana or Otel.

nightpool · on June 17, 2023

"not practical" for who? If you need to debug your clickhouse clusters, you look at the clickhouse tool. That's it. This isn't an alerting/monitoring solution, it's a specialized tool for debugging and fixing issues with running clusters.

that kind of thinking (that it's too hard to learn a second tool) is how datadog gets away with charging $$$$ for mediocre versions of 10 different products that cost an order of magnitude more than they would individually. the benefits you get from combining everything into one tool are vastly overstated compared to the benefits you get from having the in-house expertise to use the right tool for the job.

JOnAgain · on June 17, 2023

Readme could link to or explain what clickhouse is, for those of us who might not know.

mdekkers · on June 17, 2023

Clickhouse is a really cool and stupidly fast columnar database

schoolornot · on June 17, 2023

I understand why OLAP writes are faster but is there any reason why OLTPs can't achieve similar read performance with denormalized and sharded data?

rscrawfo · on June 18, 2023

Aggregation is a huge reason. For rolling up data, something can’t like Clickhouse can’t be beat by oltp

tempest_ · on June 17, 2023

What problems can I solve with a columnar database?

What type of data benefits from that type of Database?

Exuma · on June 17, 2023

Imagine you have a small business that tracks in the order of 10's - 100's of millions of events (pageviews, clicks, whatever), and you have reporting you want to run. Trying to do this in PG/MySQL would likely need to use materialized views so your reports don't take a long time to run. You could store your event data in CH directly, or use ELT/ETL process to sync/copy it into clickhouse just for reporting. Then, your queries would be very fast. It's must faster (for certain types of queries, mainly timeseries queries or queries involving aggregation of many rows). It's faster because of how the data is stored on disk. It's NOT good for fetching/updating/deleting single rows however.

It's originally designed to handle hundreds of columns, and billions of rows, but I think it can still apply to much smaller use cases that value performance. I'm implementing it currently in a similar scenario, and I'm using AirByte OSS version to ELT from postgres. Then I'm using tableau or some other BI tool to analyze that data much more effectively (I will be trying to perform complex aggregations/group by reports on 100mm rows)

datatrashfire · on June 17, 2023

Row based databases are optimized for accessing compete rows and joins. Columnar storage is optimized for accessing all, or many column values across rows. This makes aggregates and applying transformation logic faster with columnar storage than row based storage. Ie they are great for data warehouses and other analytical workloads.

Ps, great and still highly relevant resource covering all the major database system designs, their advantages and drawbacks: https://www.oreilly.com/library/view/designing-data-intensiv...

FridgeSeal · on June 17, 2023

Less about the data itself and more about the specific operations you want to do on it.

Large aggregations, massive datasets, large joins, and workloads that are ready heavy and eschew row-level mutations.

They get used for data analysis frequently, time series data and associated analysis meshes quite nicely too. ClickHouse itself was originally built to support arbitrary analytical queries on clickstream data at pretty massive scale. Cloudflare uses it for live analytics, Uber uses it for logs.

pjot · on June 17, 2023

An over simplification:

Columnar stores are optimized for reads. Row stores are optimized for writes.

anonacct37 · on June 17, 2023

This is an overly simplistic but also correct answer: clickhouse was developed for analytics on clickstreams.

Technically the overall idea is that if you have lots of queries that only read certain columns and your database stores rows contiguously it's a waste to read a whole row and then discard columns.

Also compression (such as run length or delta or even ztsd) often works better if you give it a block of data that's from one column (such as a timestamp or tag value).

linuxdude314 · on June 17, 2023

That’s a longer subject that fits in a comment here.

If you are _actually_ interested I suggest using google search to find some good sites that go over what a column oriented database does/is used for.

This isn’t hard; I’ll get you started:

https://www.kdnuggets.com/2021/02/understanding-nosql-databa...

Exuma · on June 17, 2023

Or he, you know, could just ask, because that is the spirit of discussion.

Dachande663 · on June 17, 2023

Cloudflare use it to ingest 6M/s

https://blog.cloudflare.com/http-analytics-for-6m-requests-p...

jgrahamc · on June 17, 2023

Way more than that now.

cplli · on June 17, 2023

Personally tried it, it can handle logs nicely. And from their page, many more things

https://clickhouse.com/use-cases

craigching · on June 17, 2023

Uber wrote a blog on using Clickhouse to store logs: https://www.uber.com/blog/logging/

esafak · on June 17, 2023

Columnar databases let you do fast aggregations and read only the columns you are interested in. They are for analyzing data.

linuxdude314 · on June 17, 2023

That’s a bit silly. If you don’t know what something is, you can google it pretty easily.

Everyone doesn’t need to cater to the lowest common denominator of knowledge.

KevinChen6 · on June 17, 2023

[flagged]

Exuma · on June 17, 2023

Define "Better"

KevinChen6 · on June 17, 2023

MDX describes data through a multidimensional structure, which makes the semantic model it presents closer to the real business, and based on this multidimensional model for more complex queries, SQL models can also provide similar capabilities, but it may be laborious or even extremely difficult to achieve when dealing with complex queries, but MDX also has disadvantages compared to SQL, that is, to thoroughly understand the multidimensional data model than to understand the SQL table model requires more learning costs.