Capture/lump a new "Others" group when applying limits #29350
mistercrunch
started this conversation in
Ideas
Replies: 1 comment 1 reply
-
@dosu any updates on that matter? When a user set series limit inside chart, all other data are not represented. Is it possible to see them grouped by "others" category? |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
In a recent discussion between @john-bodley @michael-s-molina @villebro @mistercrunch and @rusackas about handling/labeling the "Totals" fields for non-additive metrics in the "table" visualization, we all seemed to agree strongly that a new feature around "lumping" or "creating an extra aggregate for everything outside the limit" would be a very useful feature and help making "Total" or. "Summary" row align with the content of the table.
I'm opening the discussion here on GitHub to summarize what we spoke about, open thus topic to all, and discuss the intricacies around all this.
For the purpose of this conversation, let's call this family of features "lumping" as a way to get a full picture of the dataset while limiting the level of details.
Exposing this feature/option to users
It seems that wherever limits are set, an extra checkbox would be required. Labeling could be something like
Include 'Other' bucket
orAggregate Remaining Data
(good naming might be difficult here).SQL / compute implications
There are multiple ways to do this, but one approach would using a
LEFT JOIN
to a limited subset, resulting in a single scan and proper grouping in the outer query.Note that some databases may not support CTEs or subqueries, in which case the feature could either be unsupported, or we could run something similar in multiple phases, though that may get messy. Picture a LIMIT 1000 on multiple dimensions (say country & region, that makes for a nasty looking second phase with 1000
OR
predicates (WHERE (country='Canada' AND region='Quebec') OR (country='Canada' and region='Alberta', {+998 other OR clauses}
). Maybe this feature is only supported with databases that support subqueries.Other potential approaches:
Seems the LEFT JOIN + COALESCE is probably most efficient.
Labeling lumped groups
Maybe simply using "Others" here may work, or going more specific with "Other {dimension_label}". As far as the backend goes, we should be cautious around collision as the dimension may already contain an
Other
member and should be treated as a normal member (it shouldn't collide)Relationship to "Series Limit"
Series limit is an interesting feature where we limit the number of time series as opposed to limiting the number of rows, the typical example is showing a line chart of population by country, but only for the top N countries. Clearly here showing an extra timeseries line for "other countries" would be useful, but is a bit of an edge case since we want to lump "other countries", but preserve the time component of it. In any case, I wanted to point out that while the user might check the same checkbox labeled
Aggregate Remaining Data
in the context of a time series, the handling behind the scene might be different as to how it's computed on the backend.Note that handling lumping at the "row-level" (while grouping on dimension+time_grain) for time series doesn't really make sense. It does absolutely make sense at the series level though.
The handling of lumping for timeseries should be different than it is for non-time-series.
Other considerations and challenges:
country
andmarket_category
I'm getting the top 100 combination of those, and I can't get a total for any given country or market_category. Allowing for top 20 coutnries an top 15 market_category independently could be a desirable option in some casesBeta Was this translation helpful? Give feedback.
All reactions