Login Page - Create Account

Support Board


Date/Time: Thu, 28 Mar 2024 18:23:52 +0000



[User Discussion] - Python for Sierra Chart

View Count: 33598

[2024-02-13 16:44:01]
User150671 - Posts: 46
If you don't need the convenience of a dataframe, you can save the extra overhead by just reading into and writing from a np.array.
Example:

read a scid file to a np.array:

def get_scid_np(scidFile, limitsize=sys.maxsize):
  # Check Path Exists
  f = Path(scidFile)
  assert f.exists(), "SCID file not found"
  # set the offset from limitsize
  stat = f.stat()
  offset = 56 if stat.st_size < limitsize else stat.st_size - (
    (limitsize // 40) * 40)

  ### Setup the SCID dtype
  sciddtype = np.dtype([
    ('Time', '<u8'),
    ('Open', '<f4'),
    ('High', '<f4'),
    ('Low', '<f4'),
    ('Close', '<f4'),
    ('Trades','<i4'),
    ('Volume', '<i4'),
    ('BidVolume', '<i4'),
    ('AskVolume', '<i4')
  ])
  scid = np.fromfile(scidFile, dtype=sciddtype, offset=offset)
  return scid



Write a np.array to a scid file:

def write_scid(scidarray,outfile,mode = 'w'):
  ## Write header file in write mode
  if mode == 'w':
    with open(outfile, 'wb') as fscid:
      header = b'SCID8\x00\x00\x00(\x00\x00\x00\x01\x00'
      fscid.write(header)
      for i in range(21):
        fscid.write(b'\x00\x00')
      
      fscid.write(scidarray.tobytes())
      fscid.close()

  # Append the scid array else
  else:
    with open(outfile, 'ab') as fscid:
        fscid.write(scidarray.tobytes())
        fscid.close()
    
  return

Also to whoever wrote this piece of code originally which I see is still being used above:
BCOLS = ['datetime', 'open', 'high', 'low', 'close', 'volume', 'trades', 'bidVolume', 'askVolume']

rectype = np.dtype([

(BCOLS[0], '<u8'), (BCOLS[1], '<f4'), (BCOLS[2], '<f4'),

(BCOLS[3], '<f4'), (BCOLS[4], '<f4'), (BCOLS[6], '<i4'),

(BCOLS[5], '<i4'), (BCOLS[7], '<i4'), (BCOLS[8], '<i4')

])


I didn't notice what you had done and just took your bcols and rectype and made it one variable... because why define something in order to then use it to define another variable with?
Problem is, the bcols order DOES NOT MATCH THE ORDER THEY ARE DEFINED IN THE DTYPE (volume and trades are switched around). Took me like 40 minutes to figure out how I broke my scid file generation. Well played sir.

You can define the np.dtype in one variable as follows:

sciddtype = np.dtype([
('Time', '<u8'),
('Open', '<f4'),
('High', '<f4'),
('Low', '<f4'),
('Close', '<f4'),
('Trades','<i4'),
('Volume', '<i4'),
('BidVolume', '<i4'),
('AskVolume', '<i4')
])

Date Time Of Last Edit: 2024-02-13 16:47:25
[2024-02-13 19:30:19]
rudy5 - Posts: 2
Write a np.array to a scid file:
Nice thanks. I've been using this to plot my custom stuff. https://github.com/louisnw01/lightweight-charts-python/tree/main
[2024-02-19 22:08:45]
Deiter - Posts: 3
Many thanks to everyone who has contributed to this thread. I have borrowed from your great work and added some Polars (rather than Pandas) work to further my own work with Sierra Chart data. I present this work here (it's an extract of a Jupyter Notebook that I am running in VS Code), in the hope of repaying some of the goodness y'all have provided.

# Purpose: Getting Sierra Chart .scid file information into a Polars dataframe for analysis and backtesting

# Approach:
# - Use Polars, rather than Pandas, to reap increased speed and large-file handling benefits.
# - Use Numpy to extract binary data from scid file and cast into appropriate data types in a Polars dataframe.
# - Handling of the SCDateTime data is the hardest part, as it cleverly countains both timestamp and trade count information in a single 64 bit integer, so I use some string methods and regex to extract the desired information.

# Outcome Notes:
# - My final dataframe (df_extract) requires further work to get the timestamp info (in column "SCDateTime_ms", signifying that it is an integer of milliseconds), into a Python datatype for time analysis. I also need to account for the cases where the trade counter information maxes out at 999 (see the Reddit discussion referenced below for details).
# - According to the Jupyter Notebook I ran this in, getting to my final dataframe (thus far) took 4.4 seconds for a 41 million row ES futures .scid file.
# - I'm not interesting in comparing different approaches to see speed differences (there are many articles online about the speed of vectorized Numpy operations vs loops and Polars vs Pandas), so I don't plan to run a bunch of speed tests. 41M rows in 4.4s is good enough for me right now.

# References:
# - Sierra Chart Support Board thread Python for Sierra Chart
# - Reddit post on reading scid data https://www.reddit.com/r/algotrading/comments/169ax2x/the_hello_world_guide_to_reading_execution_data/
# - Awesome Sierra Chart documentation Intraday Data File Format
# - The Polars website https://pola.rs/

import sys
from pathlib import Path

import numpy as np
import polars as pl


# The following section is borrowed, with one small change to the data type definition, from the Sierra Chart Support Board referenced above.


def get_scid_np(scidFile, limitsize=sys.maxsize):
# Check Path Exists

f = Path(scidFile)

assert f.exists(), "SCID file not found"

# set the offset from limitsize

stat = f.stat()

offset = 56 if stat.st_size < limitsize else stat.st_size - ((limitsize // 40) * 40)

### Setup the SCID dtype

sciddtype = np.dtype(
[
("SCDateTime", "<u8"),
("Open", "<f4"),
("High", "<f4"),
("Low", "<f4"),
("Close", "<f4"),
("NumTrades", "<u4"),
("TotalVolume", "<u4"),
("BidVolume", "<u4"),
("AskVolume", "<u4"),
]
)

scid_as_np_array = np.fromfile(scidFile, dtype=sciddtype, offset=offset)

return scid_as_np_array

intermediate_np_array = get_scid_np("ESH24_FUT_CME.scid")

df_raw = pl.DataFrame(intermediate_np_array)


# Now convert the SCDateTime 64 bit integer into a string.
df_SCDateTime_as_string = df_raw.with_columns(
pl.col("SCDateTime").cast(pl.Utf8).alias("SCDateTime_string"),
)


# Trickiest part thus far. Use regex to extract the timestamp info and trade counter info and cast them back into integers for further analysis.

df_extract = df_SCDateTime_as_string.with_columns(
pl.col("SCDateTime_string")
.str.extract(r"^.{13}", 0)
.cast(pl.Int64)
.alias("SCDateTime_ms"),
pl.col("SCDateTime_string")
.str.extract(r".{3}$", 0)
.cast(pl.Int32)
.alias("SCDateTime_trade_counter"),
)

Date Time Of Last Edit: 2024-02-19 22:19:11
[2024-02-19 23:15:07]
User150671 - Posts: 46
Handling of the SCDateTime data is the hardest part, as it cleverly countains both timestamp and trade count information in a single 64 bit integer, so I use some string methods and regex to extract the desired information.

I was not tracking that it contains any trade count info whatsoever ref:
Intraday Data File Format: DateTime

I have following functions for converting sierra time integers:

# takes a sierra Time and returns it as a dt.datetime
def unSierraTime(timeint):
  exceldate = -2209161600.0
  timeint = timeint/1000000
  delta = timeint + exceldate
  delta = dt.datetime.fromtimestamp(delta).replace(tzinfo=dt.timezone.utc)
  return delta

# Takes a Pandas TimeStamp or a datetime index and turns it into a sierra time integer
def sierraTime(dtg):
  excelDate = (pd.to_datetime('1899-12-30')).tz_localize(tz="utc")
  delta = dtg - excelDate

  if type(dtg) == pd._libs.tslibs.timestamps.Timestamp:
    delta = int(delta.total_seconds()*1000000)

  else:
    delta = (delta.total_seconds()*1000000).astype('int64')
  
  return delta

[2024-02-20 16:54:53]
Deiter - Posts: 3
Yeah, I was surprised as well, but I think this is the case, based on three things:
- The page in the SC docs you linked has this section: "When writing tick by tick data to an Intraday data file, where each tick is a single record in the file, and multiple ticks/trades have the same timestamp, then it is acceptable for the Date-Time of the record to be the same timestamp as prior records. What Sierra Chart itself does in this case, is increment the microsecond portion of the Date-Time to create unique timestamps for each trade within a millisecond."
- The Reddit article that first got me thinking about this: https://www.reddit.com/r/algotrading/comments/169ax2x/the_hello_world_guide_to_reading_execution_data/
- and finally my own read of the .scid file. I have attached a screenshot of the raw SCDateTime data that I imported into my dataframe, showing two runs of tick data where the least significant digit/byte of the data is incrementing by one, while the rest of the SCDateTime is not changing.
imageES tick by tick SCDateTime extract.png / V - Attached On 2024-02-20 16:53:46 UTC - Size: 109.09 KB - 41 views
[2024-02-20 23:03:17]
sgne - Posts: 101
One thing I noticed is that if you use a custom chart bar function loaded on a chart, then the extracted datetime follows the time zone of the chart, but when you do a direct extract from the scid file, then the time is gmt/utc.

The C++ commands to use in a {SCCustomChartBarInterfaceRef ChartBarInterface} function could be these:

SCDateTime tTime=ChartBarInterface.NewFileRecord.DateTime;
long long tickTime=reinterpret_cast<long long&>(tTime);
tickTime/=1000; // to get rid of microseconds, which are trade numbers

tTime can be used to get all sorts of data out of the SCDateTime constant reference, e.g.:

double xlTime=tTime.GetAsDouble();


And this is all easily done in ahk as well, with commands like this:

hdr:=56,numBytes:=40
scDateTimePosition:=0
scPricePosition:=12

; scDateTime position (int64) - 0 bytes in
fo.Pos := hdr+(A_Index-1)*numBytes+scDateTimePosition
scDateTime := fo.ReadInt64()//1000 ; get rid of trade number
dtetme := 18991230000000
dtetme += scDateTime//1000,seconds
dte := SubStr(dtetme,1,8)

; scPrice position (float) - 12 bytes in
fo.Pos := hdr+(A_Index-1)*numBytes+scPricePosition
floatPrice := fo.ReadFloat()
[2024-03-16 20:03:08]
User726340 - Posts: 30
updated original func for users interested in pandas df.


import sys
from pathlib import Path
import numpy as np
import pandas as pd

def get_scid_df(filename, limitsize=sys.maxsize):
f = Path(filename)
assert f.exists(), f"{f} file not found"
stat = f.stat()
offset = 56 if stat.st_size < limitsize else stat.st_size - ((limitsize // 40) * 40)
sciddtype = np.dtype(
[
("Time", "<u8"),
("Open", "<f4"),
("High", "<f4"),
("Low", "<f4"),
("Close", "<f4"),
("Trades", "<i4"),
("Volume", "<i4"),
("BidVolume", "<i4"),
("AskVolume", "<i4"),
]
)
df = pd.DataFrame(
data=np.memmap(f, dtype=sciddtype, offset=offset, mode="r"), copy=False
)
df.dropna(inplace=True)
df["Time"] = df["Time"] - 2209161600000000
df.drop(df[(df.Time < 1)].index, inplace=True)
df.set_index("Time", inplace=True)
df.index = pd.to_datetime(df.index, unit="us")
df.index = df.index.tz_localize(tz="utc")
return df


def resample_scdf(scdf_or_filename, period="1Min", tz="UTC", limitsize=sys.maxsize):
df = (
get_scid_df(scdf_or_filename, limitsize)
if isinstance(scdf_or_filename, str)
else scdf_or_filename
)
assert isinstance(
df, pd.DataFrame
), f"{scdf_or_filename} has to be SC df or valid scid file"
df = df.resample(period).agg(
{
"Open": "first",
"High": "max",
"Low": "min",
"Close": "last",
"Trades": "sum",
"Volume": "sum",
"BidVolume": "sum",
"AskVolume": "sum",
}
)
if tz != "UTC":
tz = "America/New_York" if tz == "EST" else tz
tz = "America/Los_Angeles" if tz == "PST" else tz
df.index = df.index.tz_convert(tz)
return df

to get a scid 1Min df in EST : print(resample_scdf("ESH24.scid", tz="EST"))
to get a scid 30Min df in EST : print(resample_scdf("ESH24.scid", "30Min", tz="EST"))
Date Time Of Last Edit: 2024-03-17 19:36:01

To post a message in this thread, you need to log in with your Sierra Chart account:

Login

Login Page - Create Account