New DataValues System
Did You Know? The content in this RFC is obsolete and Databend now has a formal type system.
Learn More
Summary
Short comes of current DataType
DataType
is an enum type, we must use specific type after matching. For example, if we want to create deserializer/serializer byDataType
, we should always do matching. It does not mean that match is not necessary. If we want to add more and more functions toDataType
, matching may be very annoyment.DataType
represented as enum type, we can't use it as generic argument.DataType
may involve some nested datatypes, such asDataType::Struct
, but we putDataField
insideDataType
, it's logically unreasonable。Hard to put attributes into enum based
DataType
, such as nullable attribute #3726 #3769
Too many concepts about column (Series/Column/Array)
- DataColumn is an enum, including
Constant(value)
andArray(Series)
pub enum DataColumn {
// Array of values.
Array(Series),
// A Single value.
Constant(DataValue, usize),
}
- Series is a wrap of
SeriesTrait
pub struct Series(pub Arc<dyn SeriesTrait>);
- SeriesTrait can implement various array,using many macros.
pub struct SeriesWrap<T>(pub T);
impl SeriesTrait for SeriesWrap<$da> {
fn data_type(&self) -> &DataType {
self.0.data_type()
}
fn len(&self) -> usize {
self.0.len()
}
...
}
- For functions, we must consider about
Constant
case forColumn
, so there are many branch matching.
match (
columns[0].column().cast_with_type(&DataType::String)?,
columns[1].column().cast_with_type(&DataType::UInt64)?,
) {
(
DataColumn::Constant(DataValue::String(input_string), _),
DataColumn::Constant(DataValue::UInt64(times), _),
) => Ok(DataColumn::Constant(
DataValue::String(repeat(input_string, times)?),
input_rows,
)),
(
DataColumn::Constant(DataValue::String(input_string), _),
DataColumn::Array(times),
)
...
New DataValues system design
Introduce DataType
as a trait
#[typetag::serde(tag = "type")]
pub trait DataType: std::fmt::Debug + Sync + Send + DynClone {
fn data_type_id(&self) -> TypeID;
fn is_nullable(&self) -> bool {
false
}
..
}
Nullable is a special case of DataType
, it's a wrapper of DataType
.
pub struct DataTypeNull {inner: DataTypeImpl}
Simplify DataValue
pub enum DataValue {
/// Base type.
Null,
Boolean(bool),
Int64(i64),
UInt64(u64),
Float64(f64),
String(Vec<u8>),
// Container struct.
Array(Vec<DataValue>),
Struct(Vec<DataValue>),
}
DataValue
can convert into proper DataType
by it's value.
// convert to minialized data type
pub fn data_type(&self) -> DataTypeImpl {
match self {
DataValue::Null => Arc::new(NullType {}),
DataValue::Boolean(_) => BooleanType::new_impl(),
DataValue::Int64(n) => {
if *n >= i8::MIN as i64 && *n <= i8::MAX as i64 {
return Int8Type::new_impl();
}
...
}
Also, DataValue
can convert into rust primitive values and vice versa.
Uniform Series/Array/Column
into Column
Column
as a trait
pub type ColumnRef = Arc<dyn Column>;
pub trait Column: Send + Sync {
fn as_any(&self) -> &dyn Any;
/// Type of data that column contains. It's an underlying physical type:
/// UInt16 for Date, UInt32 for DateTime, so on.
fn data_type_id(&self) -> TypeID {
self.data_type().data_type_id()
}
fn data_type(&self) -> DataTypeImpl;
fn is_nullable(&self) -> bool {
false
}
fn is_const(&self) -> bool {
false
}
..
}
- Introduce
Constant column
Constant column
is a wrapper of aColumn
with a single value(size = 1)#[derive(Clone)]
pub struct ConstColumn {
length: usize,
column: ColumnRef,
}
impl Column for ConstColumn {..}
- Introduce
nullable column
nullable column
is a wrapper of aColumn
and keep an extra bitmap to indicate null values.
pub struct NullableColumn {
validity: Bitmap,
column: ColumnRef,
}
impl Column for NullableColumn {..}
- No extra cost convert from or into Arrow's column format.
fn as_arrow_array(&self) -> common_arrow::ArrayRef {
let data_type = self.data_type().arrow_type();
Arc::new(PrimitiveArray::<T>::from_data(
data_type,
self.values.clone(),
None,
))
}
- Keep
Series
as a tool struct, this may help to fast generate a column.
// nullable column from options
let column = Series::from_data(vec![Some(1i8), None, Some(3), Some(4), Some(5)]);
// no nullable column
let column = Series::from_data(vec![1,2,3,4);
- Downcast into the specific
Column
impl Series {
/// Get a pointer to the underlying data of this Series.
/// Can be useful for fast comparisons.
/// # Safety
/// Assumes that the `column` is T.
pub unsafe fn static_cast<T>(column: &ColumnRef) -> &T {
let object = column.as_ref();
&*(object as *const dyn Column as *const T)
}
pub fn check_get<T: 'static + Column>(column: &ColumnRef) -> Result<&T> {
let arr = column.as_any().downcast_ref::<T>().ok_or_else(|| {
ErrorCode::UnknownColumn(format!(
"downcast column error, column type: {:?}",
column.data_type()
))
});
arr
}
}
- Convenient way to view a column by
ColumnViewer
No need to care about Constants
and Nullable
.
let wrapper = ColumnViewer::<i8>::try_create(&column)?;
assert_eq!(wrapper.len(), 10);
assert!(!wrapper.null_at(0));
for i in 0..wrapper.len() {
assert_eq!(*wrapper.value(i), (i + 1) as i8);
}
Ok(())
let wrapper = ColumnViewer::<bool>::try_create(&column)?;
let c = wrapper.value(0);
let wrapper = ColumnViewer::<&str>::try_create(&column)?;
let c = wrapper.value(1);
Ok(())
TODO
- Make
datavalues2
more mature. - Merge
datavalues2
into Databend.