Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why aren't openxlsx2 formulas evaluated? #863

Open
JanMarvin opened this issue Nov 29, 2023 · 4 comments
Open

Why aren't openxlsx2 formulas evaluated? #863

JanMarvin opened this issue Nov 29, 2023 · 4 comments
Labels
documentation ✍️ Improvements or additions to documentation wontfix 🚫 This will not be worked on

Comments

@JanMarvin
Copy link
Owner

JanMarvin commented Nov 29, 2023

I have written about this issue already in various places of the openxlsx issue tracker and probably in other places and this issue is just to raise awareness and not something to solve or something I'm going to work on.

The reason why formulas are not evaluated is, because there is nothing to evaluate them.

  • "But Excel shows me numbers if I open a file with formulas?" "Yes."
  • "And for some formulas wb_to_df() also shows values?" "Yes again."

Why is this the case?

The XML file structure of a formula written by us looks like this:

<c ...>
   <f>SUM(A1:B1)</f>
</c>

Here we have a cell with a simple formula: SUM(A1:B1). If A1 is 1 and B1 is 1 the formula will evaluate 2. In your spreadsheet software you can see the value 2. If you save the file to disk and load it again, the XML structure of the cell has changed. Now it looks like this:

<c ...>
   <f>SUM(A1:B1)</f>
   <v>2</v>
</c>

It contains a value field now and this value is 2. Now if we read the value from openxlsx2 we also get a 2. But only because this value is now written into the cell. It is "cached" into the cell. And this might actually become an issue.

If you replace the cell A1 with a value of -1 (wb_add_data(dims = "A1", x = -1). Your formula in cell will still print 2 if read by wb_to_df(), and the formula is obviously not evaluated, because it should show a value of 0.

But why is the value not updated?

That is because, computers programs - in this case your spreadsheet software - need a way to evaluate the formula in the cell with actual data in the spreadsheet. The software needs a way to calculate the formula and it needs a reason to evaluate this formula. The implementation of the formula evaluation is somewhat simple with basic formulas like A1 + A2 and SUM(A1:A2), but it will get tricky if you ever try to implement something like nested VLOOKUP() / CHOOSE() functions, functions where R and your spreadsheet software produce different results and last but not least, someone has to program all/many/most/at least a few of the functions and their quirks into a function we can evaluate in R. The first part is, to actually provide functions that reproduce what the complementary spreadsheet functions do. Since we cannot use a spreadsheet software, we would have to program them ourselves. And I do not want to do this and I do not want to be the one who does this and if you want me to be the one, lets talk about my salary ..., but good news!

You could be the one!

It is not really hard to begin with. First, most likely you do not want EVERY function (this could take a while, because there are a lot), but instead focus on only a few. Second, begin with something easy and not with an overly complex 300 lines of code containing nested INDEX() functions. Instead let us have a look at the basics, like ABS(), SUM() or AVERAGE() and maybe begin with something like this. The idea is get the sheet, reference and operators from the formula and evaluate everything in some R formula just like base R functions.

Draft code

Here I use spreadsheet formulas using + and SUM() and my code matches them with Rs + and sum(). Similar I could use MIN()/MAX() and other functions that have identical names. For AVERAGE() I might need something like average <- function(...) mean(...) and other more complex functions might require actual custom written functions:

library(openxlsx2)

dat <- data.frame(
  num1 = 1,
  num2 = 2,
  fml1 = "A2 + B2",
  fml2 = "SUM(A2, B2)",
  fml3 = "'Sheet 1'!A2 + B2",
  fml4 = "FST + B2"
)
class(dat$fml1) <- "formula"
class(dat$fml2) <- "formula"
class(dat$fml3) <- "formula"
class(dat$fml4) <- "formula"

wb <- wb_workbook()$add_worksheet()$add_data(x = dat)
wb$add_named_region(dims = "A2", name = "FST")
# wb$open()

dat <- data.frame(
  x = 0:10,
  y = -5:5
)
dat_fml <- data.frame(
  x = "SUM(A2:A12)",
  y = "SUM(B2:B12)"
)
class(dat_fml$x) <- "formula"
class(dat_fml$y) <- "formula"

wb$add_worksheet()$add_data(x = dat)$add_data(x = dat_fml, dims = "A14")


wb_eval_excel_fml <- function(wb, sheet, dims) {
  # example function that works with a tiny subset of formulas
  # if you use this function in production you are braver or
  # desparater than you look.
  
  wb <- wb$clone()
  
  openxlsx2:::assert_workbook(wb)
  sheetid <- wb$validate_sheet(sheet)
  
  fmls <- as.character(wb$to_df(dims = dims, sheet = sheetid, show_formula = TRUE, col_names = FALSE))
  
  message("Input formula is: ", fmls)
  tkns <- tidyxl::xlex(fmls)
  
  sel <- tkns$type == "ref"
  vars <- tkns$token[sel]
  rnd <- openxlsx2:::random_string(n = length(vars), pattern = "[a-z]")
  
  sheets <- wb$get_sheet_names()[sheetid]
  sheets <- rep(sheets, length(vars))
  
  if (any(sel2 <- tkns$type == "sheet")) {
    shts <- tkns$token[sel2]
    sel3 <- which(tkns$type == "ref") %in% ( which(tkns$type == "sheet") + 1)
    sheets[sel3] <- stringi::stri_extract_first_regex(shts, "([^']+)")
    
    # remove this from our formula
    tkns <- as.data.frame(tkns)
    tkns <- tkns[!sel2, ]
    
    sel <- tkns$type == "ref"
  }
  
  fml_env <- new.env()
  for (i in seq_along(vars)) {
    num <- wb_to_df(wb, dims = vars[i], sheet = sheets[i], col_names = FALSE)
    assign(rnd[i], num, fml_env)
    tkns$token[sel][i] <- rnd[i]
    message("Var ", vars[i], " is: ", num)
  }
  
  lwr_fmls <- parse(text = tolower(paste0(tkns$token, collapse = "")))
  res <- as.numeric(eval(lwr_fmls, envir = fml_env))
  message("Result: ", res)
  
  sel <- wb$worksheets[[sheetid]]$sheet_data$cc$r == dims
  wb$worksheets[[sheetid]]$sheet_data$cc$v[sel] <- res
 
  wb
}

# wb$open()

wb_to_df(wb)
#>   num1 num2 fml1 fml2 fml3 fml4
#> 2    1    2   NA   NA   NA   NA

wb <- wb_eval_excel_fml(wb, sheet = 1, dims = "C2")
#> Input formula is: A2 + B2
#> Var A2 is: 1
#> Var B2 is: 2
#> Result: 3

wb <- wb_eval_excel_fml(wb, sheet = 1, dims = "D2")
#> Input formula is: SUM(A2, B2)
#> Var A2 is: 1
#> Var B2 is: 2
#> Result: 3
wb <- wb_eval_excel_fml(wb, sheet = 1, dims = "E2")
#> Input formula is: 'Sheet 1'!A2 + B2
#> Var A2 is: 1
#> Var B2 is: 2
#> Result: 3

# # one of many cases that does not work
# wb <- wb_eval_excel_fml(wb, sheet = 1, dims = "F2")

wb <- wb_eval_excel_fml(wb, sheet = 2, dims = "A15")
#> Input formula is: SUM(A2:A12)
#> Var A2:A12 is: c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
#> Result: 55
wb <- wb_eval_excel_fml(wb, sheet = 2, dims = "B15")
#> Input formula is: SUM(B2:B12)
#> Var B2:B12 is: c(-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5)
#> Result: 0

wb_to_df(wb, show_formula = TRUE)
#>   num1 num2    fml1        fml2              fml3     fml4
#> 2    1    2 A2 + B2 SUM(A2, B2) 'Sheet 1'!A2 + B2 FST + B2
wb_to_df(wb)
#>   num1 num2 fml1 fml2 fml3 fml4
#> 2    1    2    3    3    3   NA


wb_to_df(wb, sheet = 2, show_formula = TRUE) |> tail()
#>              x           y
#> 10           8           3
#> 11           9           4
#> 12          10           5
#> 13        <NA>        <NA>
#> 14           x           y
#> 15 SUM(A2:A12) SUM(B2:B12)
wb_to_df(wb, sheet = 2) |> tail()
#>       x    y
#> 10    8    3
#> 11    9    4
#> 12   10    5
#> 13 <NA> <NA>
#> 14    x    y
#> 15   55    0
@JanMarvin JanMarvin added documentation ✍️ Improvements or additions to documentation wontfix 🚫 This will not be worked on labels Nov 29, 2023
@JanMarvin JanMarvin pinned this issue Jan 20, 2024
@SarenT
Copy link

SarenT commented Mar 3, 2025

There is another issue, that the formulas are not recalculated in a saved workbook. Is there any way to mark cells for a recompute or mark the whole document for recompute? Because, when I open a file, make some changes and save it as another file, cached values persist. I can manually recalculate whole document with Ctrl + Shift + F9 (at least in LibreOffice), but this is inconvenient, if these files are meant to be used by someone else. That person may not be aware of the outdated values. Is there a way to remove all cached values?

@JanMarvin
Copy link
Owner Author

JanMarvin commented Mar 4, 2025

You could try something similar to the snippet below. This removes all cached values for a single sheet (it wont handle pivot tables and probably many other things). But for your example, I'd say the case is a little different. In this case, it should be the responsibility of the owner of the output file to make sure that the values are all updated as expected.

Unfortunately even in spreadsheet heavy work environments it is often not possible to make sure that all workbooks are updated to the latest. Someone linked something from another spreadsheet, but in the meantime this spreadsheet has already been updated? This linked spreadsheet is on a netshare (like \\netshare mounted as Z:/) and Excel constantly confuses \\netshare and Z:/? A spreadsheet is linked unintentionally from Outlook temp files? Been there, seen it. And now you bring LibreOffice in the equation, which only supports a subset of the OOXML standard and the many extensions Microsoft has developed for the standard since 2007.

fl <- system.file("extdata", "openxlsx2_example.xlsx", package = "openxlsx2")
wb <- openxlsx2::wb_load(file = fl)

## get cc from worksheet
cc <- wb$worksheets[[1]]$sheet_data$cc
## update cached value
cc[cc$f != "", "v"] <- ""
## add cc to worksheet
wb$worksheets[[1]]$sheet_data$cc <- cc

if (interactive()) wb$open()

@SarenT
Copy link

SarenT commented Mar 4, 2025

Thanks a lot!
Yes, it is a nightmare. That's why I am avoiding all those use cases. I think that the moment you need to link 2 (even both local) spreadsheets, you should switch to something better than spreadsheets.

@JanMarvin
Copy link
Owner Author

:) Unfortunately, often it is not feasible to search for better solutions, after all time is money. But I'm not advocating for spreadsheets. I didn't learn using statistical software just for toying around and I value databases. openxlsx2 is just another (powerful) tool in the toolbox of data analytics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation ✍️ Improvements or additions to documentation wontfix 🚫 This will not be worked on
Projects
None yet
Development

No branches or pull requests

2 participants